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Preface 


Intelligent and knowledge-oriented technologies currently affect various areas of hu¬ 
man lives. They form an important component of research activity of several research 
groups active in Slovakia and Czech Republic. They also constituted a subject of 
presentations and discussions in the Smolenice Castle, where the 11 th Workshop on 
Intelligent and Knowledge oriented Technologies (WIKT) and 35th Conference on 
Data and Knowledge (Data a znalosti, DaZ) was held from the 3 rd to the 4 th of Novem¬ 
ber 2016. 

This year followed the tradition started in 2006 (the first WIKT workshop) and in 
1981 (the first DATASEM conference, which precedes Data a znalosti). A series of 
workshops and conferences during the last years fostered the creative environment and 
research by making a forum for exchanging knowledge and creative discussions in the 
field of intelligent and knowledge oriented technologies in Slovakia and Czech Repub¬ 
lic. The aim of WIKT & DaZ was always to bring together researchers from several 
research centres in Slovakia, Czech Republic and vicinity. 

Main topics of WIKT workshop were: 

- knowledge technologies and their applications, 

- information and knowledge modeling, semantic representation, 

- analysis and processing of information sources, 

- social web and its applications, analysis of social networks, 

- personalized web and its applications, recommendation, 

- processing of information sources in Slovak language, 

- semantic and service oriented architecture, 

- reasoning and inference. 

Main topics of Data a znalosti conference were: 

- data mining, 

- machine learning, classification and prediction systems, 

- creation, publication and employment of open data, 

- indexing and retrieval text and multimedia data, 

- user modelling, adaptive and persinalized systems, 

- advanced user interfaces of software and information systems, 

- systems for knowledge management in organizations, 

- expert, intelligent and agent systems, 

- natural language processed applied at real tasks, 

- ontologies and conceptual models applied at real tasks, 

- automatic reasoning and planning applied at real tasks. 



VI 


Authors sent their contributions in the form of extended abstracts (in Slovak, Czech and 
English) of the following types: 

— research paper, 

— work-in-progress paper, 

— application paper, 

— position paper, 

— PhD symposium (a special challenge for doctoral students who could offer a con¬ 
tribution related to the direction and goals of their dissertation). 

The workshop WIKT and Data a znalosti conference reaffirmed its significance this 
year, again. A total of 52 papers were submitted, most of them as research paper or 
work-in-progress paper. Each contribution was reviewed by three members of the pro¬ 
gram committee. The result of the assessment was acceptance of 50 papers in total. All 
papers were presented in lively style of short presentations followed by poster discus¬ 
sions. 10 of them were accepted for longer presentation, 26 for short announcement. 
14 submission were accepted for PhD symposium. 

Following DaZ conference tradition 8 invited lectures (among them forur experts 
from industry) on interesting topics of information processing were presented. 

Majority of the papers are written in the native language of the authors, i.e., in Slovak 
or Czech. The language of the workshop was Slovak and Czech. This fact on the one 
hand limits the dissemination of the results, but on the other hand it helps in growing 
professional language skills in the domain of rapidly developing information, 
knowledge and web technologies. 

We continued also with good tradice to organize a project meeting just before the 
conference. The meeting of HIBER project (Human Information Behavior in the Digital 
Space) was held on November 2 nd , 2016. 32 researchers from Faculty of Informatics 
and Information Technologies, Slovak University of Technology in Bratislava and 
Faculty of Arts, Comenius University discussed and barinstormed in groups on the 
beginning project directions. 

We are very pleased that this year Smolenice Castle was a record in number of par¬ 
ticipating research groups from Slovakia and the Czech Republic. We thank all authors 
for interesting contributions initiating fruitful debates. We thank the members of the 
program committee, who willingly participated in the judging of submissions and 
discussions about the direction of the workshop. We also thank them for the contri¬ 
bution to the maintenance of high professional level of the event and the fact that they 
came to the workshop with their research groups. 

We thank especially Ondrej Kassak for preparing this proceedings and to all the 
members of the organizing committee, who made a considerable effort to turn a pictu¬ 
resque spot in the heart of Central Europe into a two day passionate scientific debate 
centre and helped to spread the knowledge and collaboration. 

Bratislava, October 2016 

Maria Bielikova and Ivan Srba 



Predhovor 


Inteligentne a znalostne orientovane technologie ovplyvnuju v sucasnosti najroznejsie 
oblasti l’udskej cinnosti. Tvoria aj vyznamnu zlozku naplne cinnosti viacerych vyskum- 
nych skupin posobiacich na Slovensku a v Cesku. Tvorili aj hlavnu temu prezentacii a 
diskusii na Smolenickom zamku 3.-4. novembra, 2016, kde sa konal 11. rocnik tvo- 
rivej pracovnej dielne o inteligentnych a znalostne orientovanych technologiach WIKT 
2016 v spojenl s 35. rocnikom konferencie Dat a znalosti. 

Tento rocnlk nadviazal na tradlciu zapocatu v roku 2006 (prvy rocnlk tvorivej dielne 
WIKT) a v roku 1981 (prvy rocnlk konferencie DATASEM, ktora predchadzala kon¬ 
ferencie Data a znalosti). Seria pracovnych die Ini a konferencii pocas poslednych ro- 
kov vytvorila tvorive prostredie pre podporu vyskumu najma prostrednictvom vymeny 
poznatkov a tvorivych diskusii v atraktivnych oblastiach inteligentnych a znalostne 
orientovanych technologii na Slovensku. Snahou dielne WIKT a konferencie Data 
a znalosti vzdy bolo spajat’ vyskumnikov viacerych vyskumnych centier v sirsom za- 
bere Slovenska a Ceskej republiky. 

Hlavne temy tvorivej dielne WIKT 2016 boli: 

— znalostne technologie a ich aplikacie, 

— modelovanie informacii a znalosti, reprezentacia semantiky, 

— analyza a spracovanie informacnych zdrojov, 

— socialny web a jeho aplikacie, analyza socialnych sieti, 

— personalizovany web a jeho aplikacie, odporucanie, 

— spracovanie informacnych zdrojov v slovenskom jazyku, 

— semanticky a servisne orientovane architektury, 

— usudzovanie a odvodzovanie. 

Hlavne temy konferencie Data a znalosti 2016 boli: 

— dolovanie v datach, 

— strojove ucenie, klasifikacne a prediktivne systemy, 

— tvorba, publikovanie a vyuzivanie otvorenych a prepojenych dat, 

— indexovanie a vyhl’adavanie textovych a multimedialmch dat, 

— modelovanie pouzivatel’a, adaptivne a personalizovane systemy, 

— pokrocile pouzivatel’ske rozhrania softverovych a informacnych systemov, 

— systemy pre spravu znalosti v organizaciach, 

— expertne, inteligentne a agentove systemy, vypoctova inteligencia, 

— vj^octova lingvistika aplikovana na realne ulohy, 

— ontologicke a konceptualne modely aplikovane na realne ulohy, 

— automaticke odvodzovanie a planovanie aplikovane na realne ulohy. 



via 

Autori zasielali prispevky v tvare rozsireneho abstraktu v slovenskom, ceskom alebo 
anglickom jazyku nasledujucich kategorii: 

— vyskumny prispevok, 

— prispevok o prebiehajucom vyskume, 

— aplikacny prispevok, 

— vizionarsky prispevok, 

— doktorandske sympozium (specialnu vyzvu mali studenti tretieho stupna okolo 
dizertacnej skusky, ktori mohli ponuknuf prispevok o smerovani a ciel’och svojej 
dizertacnej prace). 

Tvoriva dielna WIKT a konferencia Data a znalosti v tomto roku znovu potvrdila svoje 
opodstatnenie. Celkovo bolo ponuknutych 52 prispevkov, vacsina z nich v kategorii 
vyskumny prispevok alebo prispevok o prebiehajucom vyskume. Kazdy prispevok 
posudili traja clenovia programoveho vyboru. Vysledkom posudzovania bolo rozhod- 
nutie o prijati 50 prispevkov. Vsetky prispevky autori prezentovali zivou diskusiou pri 
posteroch, ktorej predchadzalo oznamenie o vysledkoch. 10 z nich bolo prijatych na 
dlhsie oznamenie, 26 na kratke oznamenie. 14 prispevkov bolo prijatych do 
doktorandskej sekcie, kde v styroch sekciach prebehli zaujimave diskusie. 

Podl’a tradicii konferencie Data a znalosti v programe bolo 8 pozvanych prednasok 
na zaujimave temy spracovania informacii (medzi nimi boli styria experti z priemyslu). 

Vacsina prispevkov je napisana v materinskom jazyku autorov, teda slovensky, resp. 
cesky. Jazyk tvorivej dielne bol slovensky a cesky, co sice na jednej strane ohranicuje 
sirenie vysledkov, ale na strane druhej pomaha pestovaniu odbomeho jazyka v domene 
rychlo rozvijajucich sa znalostnych a aj webovych technologii. 

Pokracovali sme tiez v dobrej tradicii organizovania projektoveho stretnutia pred sa- 
motnou konferenciou. V stredu 2. novembra 2016 sa uskutocnilo stretnutie kprojektu 
HIBER (Human Information Behavior in the Digital Space), na ktorom sa zucastnili 
vyskumnici z Fakulty informatiky a informacnych technologii Slovenskej technickej 
univerzity v Bratislave a z Filozofickej fakulty Komenskeho univerzity. Ide o zacina- 
juci projekt, takze hlavnym ciel’om bolo diskutovat’ o konkretnom smerovani projektu. 

Zaujem o tvorivu diel’nu WIKT a konferenciu Data a Znalosti bol tento rok rekord- 
nym. Sme vel’mi poteseni, ze na Smolenickom zamku bol rekordny pocet vyskumnych 
skupin, konkretne 12 pracovisk s viac ako jednym ucastnikom. Dakujeme vsetkjmi au- 
torom za zaujimave prispevky podnecujuce diskusiu. Dakujeme clenom programoveho 
vyboru, ktori ochotne participovali na posudzovani prispevkov a diskusiach o smero¬ 
vani tvorivej dielne. A tiez za prispevok k udrzaniu vysokej odbornej urovne celeho 
podujatia aj tym, ze na pracovnu dielnu prisli aj so svojimi vyskumnjrni skupinami. 

Zaroven dakujeme Ondrejovi Kassakovi za perfektnu praci pri priprave tohto zbor- 
nika a tiez vsetkjmi clenom organizacneho vyboru, ktori vynalozili nemale usilie na to, 
aby sa dva dni jedno malebne miestecko v srdci strednej Europy premenilo na zanietene 
vedecke diskusie a pomohlo tak v sireni poznatkov a spolupraci. 


Bratislava, oktober 2016 


Maria Bielikova a Ivan Srba 
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Vyvoj databazi a jeho reflexe v konferencich DATASEM, 
DATAKON a Data a znalosti v letech 1981 - 2016 


Jaroslav Pokomy 

MFF UK, Malostranske nam 25 
Praha, Ceska republika 

pokorny@ksi.mff.cuni.cz 


Abstrakt. Padesat let vyvoje databazi jejiz uctyhodne cislo. Se zhruba desetile- 
tym zpozdenim jsme ho zaznamenali i v Ceskoslovensku. Ceskoslovenska od- 
borna komunita se zapojila rye hie do teto atraktivni problematiky. Historicky ne- 
starsi odborna setkani pod nazvern DATASEM (DATabazovy SEMinar) zapo- 
cala jiz v r. 1981. Charakteristicka pro tyto prvni konference byla velmi plodna 
symbioza odborniku teorie i praxe. Setkani se totiz hojne ucastnili vedle akade- 
miku i zastupci komereni sfery. Cilem clanku je ukazat, jak se se svetovy vyvoj 
databazi odrazel a odrazi v techto odbornych narodnich setkanich, tj. ve dvaceti 
letech seminare (pozdeji konference) DATASEM, pokracujiciho dalsich ctmact 
let jako DATAKON a konecne od r. 2015 jako konference Data a Znalosti. 

Typ prfspevku: Zvana prednaska 

Klfcova slova: databaze, databazovy system, relacni model dat, web, heterogenni 
datove zdroje, objektove-orientovane databaze, objektove-relacni databaze, on- 
tologie, XML, NoSQL, NewSQL 


1 Uvod 

Postavenl databazovych systemu (DBS) v informatice se od pocatku jejich vniku tykalo 
dvou zakladnlch problemu: 

• jak efektivne ukladat data na vnejsich pametech, 

• jak efektivne formulovat dotazy na takto ulozenymi daty. 

Na ne se naroubovaly dalsi problemy, jako je navrh databazi, jejich zacleneni do infor- 
macniho systemu (IS) organizace, integrace s daty na webu apod. Jak ukazuje historie, 
vyvoj vedl vedle praci na vhodnem software ruku v ruce s vyvojem teorie databazi, 
specializovanych casopisu, odbornych konferenci a s vyukou databazi na vetsine skol 
zabyvajicich se informatikou. 

Samostatne databazove konference zapocaly v Ceskoslovensku pocatkem 80. let 
zrejme seminarem DATASEM ’81. Sborniky z tohoto seminare vydaval dlouha leta 
Dum techniky CSVTS Praha. Jeden z prvnich clanku [11], kde by predstaven jazyk 
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Sequel (pozdeji SQL), vsak byl prezentovan na dnes uz kultovnlm seminari SOFSEM 
jiz vr. 1978. 

Z dalslch akcl venujlclch se databazlm stojl za zmlnku konference Modem! data¬ 
baze, ktera existovala od r. 1986 s nekolika pauzami az do r. 2012. Slouzil-li v prezen- 
taclch DATASEM a jeho dais! pokracovanl splse akademicke sfere navstevovane rov- 
nez ucastnlky z komercnl sfery, u Modernlch databazl to bylo obracene. Ellavni pred- 
nasky zde byly od firem a spolecnostl prodavajlclch ci vyvljeclch databazovy software, 
pficemz prednasejlcl z vysokych skol splse ukazovali soucasne trendy a prehledy data- 
bazovych technologil. Z dalslch, vztazenych konferencl muzeme jmenovat Objekty 
a Systemovou integraci, kde databaze vzdy tvorily pouze cast sirs! problematiky. 

Zamerlme-li se na DATASEM, ten byl od 15. rocnlku nazvan konferencl. Od 
r. 2001 az do r. 2014 pokracoval po novym nazvem DATAKON. Nasledujlcl rok doslo 
k integraci konferencl DATAKON a Znalosti s novym nazvem Data a znalosti. Pripo- 
menme, ze konference Znalosti existovala od r. 2001 a byla orientovana zejmena na 
realne vyuzitelne nastroje, datove zdroje a aplikace v oblasti znalostnlch technologil. 
Je vsak prlznacne, ze do r. 2001 se znalostnl problematika spolu napr. s expertnlmi 
systemy objevovala i na poradu DATAKONu (viz napr. clanek Baze znalosti a data¬ 
baze z hlediska expertnlch systemu od P. Jirku v r. 1985 a dais! clanky od P. Bartose 
a P. Hajka vr. 1986). 

Pripomenme rovnez, ze obe konference byly vzdy organizovany ve spolupraci ceske 
a slovenske odbome komunity, v nekterych rocnlclch dokonce s mezinarodnl ucastl. 

V r. 2005 bylo v [ 17] zdurazneno, ze DATAKON jiz nenl vylucne databazovou kon¬ 
ferencl. To bylo prirozene. Stale vice se smazavaly rozdlly mezi jednotlivymi discipll- 
nami. Dobrym pflkladem je zpracovanl dat v prostredl webu, kde se potkavajl databaze, 
logika, umela inteligence, zpracovanl prirozeneho jazyka a dais! obory. DATAKON se 
temto trendum nevyhybal. Podpora mezioborove komunikace je patrna i z poslednlho 
vyvoje - vzniku konference Data a znalosti. 

Cllem prlspevku je prezentovat populame historii konferencl DATASEM 
a DATAKON, dale pak jejich poslednl variantu Data a znalosti. Elistoricky pohled byl 
v minulosti konference prezentovan vlcekrat - jednou v r. 2000 k prllezitosti jejlho 20. 
vyrocl [14], dale pak v r. 2005 [17], kdy jsme se dokonce pokouseli objevit korelaci 
mezi obsahem techto konferencl a databazovymi trendy ve svete. Tento prlspevek na- 
blzl po dalslch 11 letech skromnejsl prehled techto korelaci bez hlubsl analyzy prl- 
spevku. V kap. 2 zmlnlme neco z historic databazl v Ceskoslovensku na pozadl s jejich 
vyvojem ve svete zhruba do zacatku 90. let. Kap. 3 vyzdvihuje dva zakladnl smery 
v rozvoji databazl 70. a 80. let - rozvoj SQL a vliv objektove-orientovaneho progra- 
movanl. Kap. 4 popisuje strucne 90. leta z hlediska integrace objektu a tabulek v ob- 
jektove-relacnlm modelu dat a integrace heterogennlch dat. Zmlnen je i vztah databazl 
a webu. Kap. 5 se venuje prechodu do 3. tislciletl s nastupem XML databazl a postup- 
nemu vlivu fenomenu Big Data na databazovou technologil. V kap. 6 naznaclme ne- 
ktere smery vyvoje databazl a vyhledy do budoucnosti konference Data a znalosti. 
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2 Historic databazi v Ceskoslovensku 

V [20] jsme uvedli vystizne formulovane pravdy odbomlka na SQL Joe Celko doplne- 
nlm klasickeho citatu anglickeho basnlka T.S. Eliota, ktery fika: 

Kde je moudrost? 

Ztracena ve znalostech. 

Kde jsou znalosti? 

Ztraceny v informaclch. 

A J. Celko pokracuje: 

Kde jsou informace? 

Ztraceny v datech. 

Kde jsou data? 

Ztracena v databazlch. 

Tyto citaty naznacujl, ze od dat ke znalostem, ci dokonce k moudrosti je daleko. Data- 
bazove technologic se o to take ani nesnazl, tyto clle jsou splse vyhrazeny formalizaci 
znalosti, semantickemu popisu webu a rovnez umele inteligenci. 

Historic databazi je dostatecne znama z rady databazovych ucebnic zejmena tech 
zahranicnlch, jako jsou napr. knihy [5], [21], [6], [22], ci teoretictejsl [8]. Ve strednl 
a vyhodnl Evrope byl vsakjejich vyvoj precijen specifictejsl. Bylo typicke, ze v Ces¬ 
koslovensku melo vyuzitl DBS v praxi vzdy trochu zpozdenl. Stejne tomu bylo i na 
urovni relevantnlch informacl zvlaste v akademicke sfere, kde nedostatek odborne lite- 
ratury a velmi omezena moznost navstev zahranicnlch konferencl hraly svoji negativnl 
roli. Nicmene prvnl knihu o databazlch od J. C. Date z r. 1976 jsme videli poprve v rus- 
kem vydanl nekdy v r. 1977. Preklad knihy Database systems od D. C. Tsichritzise 
a F. H. Lochovskeho napsane v r. 1977 vsak vysel v Ceskoslovensku az v r. 1987 [24]. 

Vynechame-li tzv. hromadne zpracovanl dat vyuzlvajlcl prlmo souborove techniky 
a indexovanl dat, ma databazova technologie koreny v sit’ovem databazovem modelu. 

V r. 1965 se formovala konference o jazyclch datovych systemu (Conference on Data 
Systems Languages - ve zkratce CODASYL). V ramci teto konference byl vytvoren 
vybor znamy jako Database Task Group (DBTG), ktery mel za ukol standardizacnlm 
postupem vytvorit koncepci databazoveho systemu (DBS). Vznikaly slt’ove systemy n- 
zeni bazi dat (SRBD) jako IDMS, u nas znamy z ery salovych pocltacu. Dokonce jiz 
od pocatku 60. let byl pod vedenlm Ch. Bachmana vyvljen SRBD IDS, ktery vyznamne 
ovlivnil praci vyboru DBTG. IDS je povazovan za prvnl databazovy software vubec. 

V r. 1971 vydal vybor zpravu "The DBTG April 1971 Report", kde se objevily dnes 
dobre zname databazove pojmy jako schema databaze, jazyk pro definici schematu, 
subschema apod., jakoz i celkova architektura slt’oveho databazoveho systemu. 

Temer paralelne se vyvljelyhierarchicke databaze vyuzlvajlcl nikoliv specialnl grafy 
typu zaznamu ale pouze stromy (hierarchie). Na rozdll od slt’ovych nemajl hierarchicke 
databaze standard. Historic hierarchickeho datoveho modelu je nedllne spjata se SRBD 
IMS (Information Management System). Jak IDMS, tak IMS se pouzlvaly od konce 
70. let i v Ceskoslovensku. Nahlednutlm do sbomlku DATASEM ’81 uvidlme, ze se 
nereferovalo jen o IDMS, ale i na pocltaclch Siemens specialnl variante slt’oveho mo¬ 
delu SESAM, ci o domaclm databazovem produktu s cobolskymi datovjrni strukturami 
SOFIS (vyvinutem ve Vyzkumnem vjqjoctovem stredisku v Bratislave). 
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Je priznacne, ze relacni databaze se u nas objevuji mnohem pozdeji, az v 80 letech. 
Od uvedeni relacniho modelu dat (RMD) E. F. Coddem [4] temer 10 let trvalo, nez se 
relacni databazova technologic vyvinula natolik, aby byla z hlediska vykonu DBS v 
realnem prostredl srovnatelna s tehdejsimi sit’ovymi a hierarchickymi protejsky. Pripo- 
menme zde pionyrskou implementace IBM ze 70. let, jako je System R (predchudce 
dnesniho SRBD DB2), nebo QBE z r. 1978. Tret! implementacl byl INGRES z Uni¬ 
versity of Califorania. Z komercnich relacnich produktu 80. let se mezi prukopnicke 
radi Oracle, Sybase, RDB (firmy DEC), Informix a Unify. 

S rozvojem databazove technologie byly zakonite vyvijeny i pristupy k navrhu re¬ 
lacni databaze. E. F. Codd totiz nerikal, jak navrhnout tabulky. Jeho cilem bylo, aby 
byly ve 3. normalni forme, coz nebylo pro navrhare analyzujiciho danou aplikacni do- 
menu vzdy jednoduche. Dulezitym vysledkem byl vznik E-R modelu , ktery P. Chen 
publikoval v r. 1976 [3], E-R model daval moznost konceptualniho modelovani se sys- 
tematickym pristupem k vyslednemu navrhu schemat relaci. Prestoze ma E-R model 
mnoho odpurcu, je dnes tato koncepce ve svych cetnych variantach de facto standardem 
ve svete strukturovanych metodologii navrhu nejen databazi, ale i obecnejsich systemu. 
Krome toho jsou na ni vybudovany i prostredky objektove. 

Konceptualni modelovani melo hlubokou tradici i u nas, jak dokazuji seminare 
DATAKON ’81 a '82. Je zde prezentovan databazovy model HIT zalozeny na jedno¬ 
duche teorii typu a typovanem lambda kalkulu [25]. Funkcionalni pristup HITu se v r. 
1985 dokonce prosadil na vyznamne databazove konferenci VLDB [26]. Kapitola o 
konceptualnim modelovani [12] se dostala do prekladu knihy [24]. Spolu s konceptu- 
alnim modelovanim se rozviji i s funkcni analyza, tj. SRBD se uvazuje v prostredl IS. 

Rychleji se vyvijela relacni teorie, ktera dnes tvori zaklad databazove teorie vubec. 
Zakladni pojmy jako relacni algebra a relacni kalkul, normalni formy, ci teorie transakci 
se v modifikovane podobe dostavaji i do dalsich modelu. Prvni detailnejsi domaci kniha 
o databazich vysla v r. 1992 [13]. 

3 Relace a objekty 

3.1 SQL 

Tvurce objektove-relacni technologie ve firme INFORMIX - M. Stonebraker kdysi 
prohlasil, ze SQL je mezigalakticky dotazovaci jazyk. O to vice je to pravda dnes. Vse 
se prizpusobuje SQL. Pocatkyjazyka SQL sahaji do r. 1974, kdy se jeste nazyva Sequel 
a je zameren hlavne na svou dotazovaci cast. Jeho prototypova implementace byla sou- 
casti Systemu R vyvijeneho v IBM v San Jose, kde byl zamestnan i E. F. Codd. 

Od prvniho standardu zvaneho SQL86 se na ceste vyvoje SQL objevily milniky 
SQL89, SQL92, SQL: 1999, SQL:2003, SQL:2006, SQL:2008 a SQL 2011. Vsimneme 
si dvou veci: mnohalete vzdalenosti standardu z r. 1999 od standardu 1992. Je to dano 
tim, ze SQL zalozeny na RMD do sebe absorboval objektove rozsireni. V letech 1999 
- 2006 zase SQL vstrebaval datovy model XML. Zapomenout nesmime rovnez na cast 
standardu SQL/MM z r. 2003 obsahujici rozsireni smerem k textum, prostorovym ob- 
jektum, obrazkum a dolovani dat. 
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Casti standardu SQL jsou cislovany od 1 do 14. Za zmlnku stoji, ze casti 5, 6, 8 
neexistujl, docasne pozastaven je vyvoj casti 7 - SQL/Temporal (castecne implemen- 
tovan v ORACLE llg, IBM DB2 pro operacni system z/OS, Teradata 13.10), zrusen 
byl vyvoj casti 12 - SQL/Replication. Zatim aktualni je standard SQL:2011, kde je 
treba k dispozici prikaz pro „vypnuti“ integritnich omezeni. Obsahuje take podporu 
temporalnich databazi, ktera se ovsem list od puvodniho pristupu zrusene casti 7. 

3.2 Objektova orientace 

Ponekud skromnejsi je historie objektove-orientovanych (OO) SRBD koncipovanych 
v 2. polovine 80. let a identifikovanych Manifestem skupiny Altair v r. 1990. Byly in¬ 
spire vany objektovym programovanim a objektovymi metodologiemi analyzy a na- 
vrhu. Nabizela se predstava ukladat objekty do databaze a vyuzit soucasne mnoha uzi- 
tecnych prvku OO technologie. Dalsim dulezitym duvodem pro pouziti OO byl fakt, ze 
ne pro vsechny aplikace byly relacni SRBD vhodne. Mezi reprezentativni priklady patri 
problemy modelovani objektu v systemech pro navrh (napr. CAD) ci geograficke IS. 
Zivelnost vyvoje OOSRBD zastavil de facto standard ODMG-93 a jeho nasledne verze 
[2]. Byl prijat jak vjrobci OOSRBD, tak i tvurci podpumych nastroju typu CASE pro 
navrh databazi. Na DATAKONu o bylo o OOSRBD detailneji referovano vr. 1992 
(clanky J. Pokomeho, A. Bicher a J. Valenty). 

V soucasnosti existuje okolo 20 OOSRBD 1 . Pres pocatecni optimismus se ukazalo, 
ze pocet nasazeni OOSRBD nerostl takrychle, jak se predpokladalo. Take funkce a vy- 
kon techto systemu jsou dosud na pomerne nizke urovni. Reseni, ktere prijaly hlavne 
vudci relacni databazove firmy v dalsich letech, vsak tkvi spise v objektove-relacnich 
SRBD (ORSRBD), ktere kombinuji vlastnosti relacnich SRBD s prinosem OOSRBD. 

4 Databaze v 90. letech 

90. leta se vyznacuji snahami integrovat heterogenni data v podniku a rozsirovat moz- 
nosti SRBD o dalsi datove typy. Jinymi slovy receno, cilem bylo ukladat do databaze 
vsechno, tj. mozne i nemozne. Jednim ze smeru jak technicky vyresit tyto problemy 
bylo rozsirit relacni tabulky SQL o objekty. Vjrobci SRBD zapocali uvazovat netrivi- 
alni rozsahle objekty typu text, audio, video atd. Pro tyto objekty bylo nutne vyvijet 
nove dotazovaci jazyky, ktere umoznily nejen nove typy dotazu (napr. najdi k objektu 
v prostoru jeho nejblizsiho souseda), ale i k prehodnoceni dotazovani jako takoveho. 
Vznikaly tzv. univerzalni seveiy s ad hoc pridavanymi novymi datovymi typy. 

Integrace podnikovych dat „ve velkem“ vedla k fade architektur vychazejicich z pu- 
vodnich ideji distribuovanych databazi resenych v 80. letech. Slo vlastne o pristup 
zdola-nahoru k reseni distribuovane databaze, zalozeny na (rucni) integraci dilcich da- 
tabazovych schemat. Nemale usili bylo venovano reseni semantickych konfliktu mezi 
daty nekolika databazi a transakcim nad vice databazemi. Mimochodem, neuspesnost 
techto architektur v ramci IS podniku nakonec vedla k vyvoji datovych skladu (DW). 


1 https://en.wikipedia.org/wiki/Comparison_of_object_database_management_systems 
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Integrace dat se v DW provadi tak, ze se potrebna data „vypumpuji“ z operacnich da- 
tabazi, vycisti a ulozi do databaze specialni. 

4.1 Objekty, relace s objekty 

Zlatym vekem OOSRBD byla 90. leta (viz rovnez rocnik DATASEM ’93 a ’94). Pres- 
toze existovaly nazory, ze OOSRBD zcela vytlaci relacni systemy, nestalo se tak. Na- 
opak, relacni systemy se prizpusobily objektovym. Cilem bylo rozsirit relacni datovy 
model o objekty [23]. Vznika objektove-relacni (OR) databazovy model reprezento- 
vany standardem SQL: 1999. 

OO a OR sice konverguji, ale spise jen ve sve casti, ktera se tyka dotazovani. Pro 
OO i OR modelovani je charakteristicka predevsim bohatost typu objektu, ktere jsou 
k dispozici, a rozsifitelnost o dalsi typy. Tradicni relacni databaze umoznovaly mode- 
lovat takovy svet jednoduse, ovsem za cenu mnohdy sloziteho a neefektivniho pristupu 
k odpovidajicim datum. 

ORSRBD se pokouseji preklenout mezeru mezi relacni technologii a OOSRBD. Pri- 
davaji moznosti ukladat objekty do relacni databaze. Zapouzdfenim metod a datovych 
struktur muze OR server vyvolat slozite operace pro prohledavani a transformaci napr. 
slozitych multimedialnich dat. Je tak vlastne resen problem univerzalnich serveru. Pro- 
blemem ovsem vzdy byla a je implementace takoveho pristupu. Nezapomenme take, 
ze relacni funkcnost (dotazovani, aktualizace apod.) je stale casti i ORSRBD, tj. za- 
kladnimi objekty jsou i nadale relace (tabulky). 

ORSRBD se zdaly diky objektovemu rozsireni SQL slibnjnn clankem ve vyvoji da- 
tabazove technologie. Po vice nez 15 letech existence podobne jako drive technologie 
OO vsak nedosahly v aplikacich rozsireni srovnatelneho s ciste relacnimi SRBD. 

4.2 Integrace heterogennlch dat 

Problem integrace heterogennich data se resi v databazove historii neustale. 80. a 90. 
leta nabidla radu technik, jak integrovat heterogenni data. Patri sem hlavne pristup pres 
globalni schema, federativni databaze a multidatabaze. Vetsina techto systemu predsta- 
vovala staticke reseni, ktere neobstoji v dynamickem prostredi, kdy jednotlive databaze 
potrebne pro vyhodnoceni nejakeho pozadavku nejsou ani dopredu znamy. 

Zrejme nejmene staticke reseni z techto pristupu nabizela architektura federace. 
V prostredi intemetu je vsak zadouci integrovat i nestrukturovana ci semistrukturovana 
data s dotazovanim, ktere je zalozeno na volnejsich principech, nez napr. pomoci SQL. 
Koncem 90. let se objevil novy datovy model a jazyk XML pro popis semistrukturova- 
nych dat. Jeho standard 2 se stal dalsim milnikem na ceste databazovou historii. 

S webovymi sluzbami se objevilo nepresne vyhledavani, tak jak se pouzivalo leta 
pred tim v dokumentografickych systemech (jiz v DATASEM ’84). Uzivatel chce vy- 
hledat „podobne“ dokumenty, jako ten, ktery prave studuje, ale treba take nejake vy- 
robce kocarku v jiste cenove kategorii, bez ohledu na to zdali budou vsechny, tak jak 


2 https://www.w3.org/TR/REC-xml/ 
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by mu je nabidnul relacni system zalozeny na SQL, ci dokonce serazeny podle nejakych 
uzivatelskych priorit. 


Tab. 1 Kategorie prispevku a jejich pocty 



1981-2000 

2001-2005 

2006-2015 

Kategorie 

#p 

#p 

#p 

DB modely 

32 

13 

7 

Ontologie 

NULL 

NULL 

3 

NOSQL, Big Data 

NULL 

NULL 

7 

SRBD cizi 

29 

1 

0 

SRBD domaci 

17 

0 

0 

Distribuovane SRBD 

22 

0 

4 

Teorie databazi 

7 

4 

2 

Architektury DBS 

20 

6 

3 

Projektovani IS 

70 

12 

14 

Dotazovaci jazyky 

19 

9 

1 

Textove databaze, 

Zpracovani textu na Webu 

20 

5 

14 

Site, Internet 

7 

NULL 

NULL 

Site 

NULL 

3 

0 

Web, XML, Open Data 

NULL 

16 

26 

Fyzicke datove struktury, provoz 
DBS 

15 

9 

1 

Umela inteligence 

34 

9 

NULL 

Dolovani dat, Analytika 

NULL 

NULL 

13 

Aplikace 

15 

12 

16 

Prehledove prispevky 
(Tutorialy) 

12 

0 

27 

Bezpecnost 

11 

15 

10 

Rizeni IS/ITC 

17 

3 

10 

Ostatni 

20 

6 

24 

Celkem 

367 

123 

182 


V tabulce 1 (casti prevzaty z [15], [14]) jsou pocty prispevku na konferenclch 
DATASEM a DATAKON rozdelene podle temat, ktera byly jasne identifikovatelna 
a odrazela vyznamne smery (samozrejme ne vsechny!) vyvoje ve svete. Poslednl slou- 
pec jiz reprezentuje prvnlch 5 let 3. tislciletl. NULL oznacuje, ze kategorie nenl 
pro dane obdobl definovana. 

5 Vstup do 3. tisicileti 


Prelom tisicileti je i v databazove technologii ve znameni intemetu a webu. Web po- 
skytuje jednoduchy a univerzalni standard pro vymenu informaci. Po r. 2000 se inten- 
zivne rozjely aktivity v technologii XML, zejmena pak ve vyvoji XML databazi. Dalsi 
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vyvoj smeroval v souvislosti s rozvijejicim se fenomenem Big Data smerem ktzv. 
NoSQL databazim (nazev byl pro tento software pouzit v r. 2009). Pripomenme, ze po- 
jem Big Data se znamymi charakteristikami se objevil v r. 2001 v [7], Z pohledu webu 
se rozvijel pojem semantickeho webu jako databaze (viz napr. clanek Guttnera a 
Hrusky na DATAKON 2003) ci clanky o RDF databazich z pozdejsich let. 

5.1 XML databaze 

S pfichodem XML vznikl novy databazovy model, nove dotazovaci ci obecneji mani- 
pulacni jazyky. Vznikaji XML databaze. 

Pro ukladani XML dat do databaze existuji dve zakladni architektury: databaze zpri- 
stupnujici XML data ulozena napr. v relacnim SRBD a nativni XML databaze. Nahled- 
neme-li do seznamu, ktery udrzoval na svych webovych strankach R. Bourret [ 1 ] do r. 
2010, zjistime, ze tehdy existovalo 24 komercnich produktu prvniho druhu a 39 pro- 
duktu druheho druhu. 

V r. 2003 s objevila verze standardu SQL:2003 integrujici datovy model XML do 
relacniho prostredi. V SQL:2006 dochazi k uplne integraci XML do SQL vcetne jazyka 
XQuery. Jazyk SQL rozsireny o XML se nazyva SQL/XML 3 . Tvori cast 14 standardu. 

DATAKON reagoval na rozvoj XML databazi clankem [16], ktery se stal jednim ze 
zakladnich zdroju pro ceskou knihu o XML technologiich [10] vydanou v r. 2008. 

5.2 Smerem k velkym datum 

Otocime-li se smerem ke konkretnim problemum, ktere ovlivnuji soucasne nove data- 
bazove technologie, existuji dva zasadni - velikost databazi a heterogennost databazi. 

V aplikacich se dostavame do jednotek, jako jsou petabajt a exabajt. Realny je i zet- 
tabajt (10*21) v oblastech jako data z vyzkumu Zeme ci video-audio archivu. 

Po mnoho let se ve vyvoji IS spolehalo na vertikalni skalovani, tj. investovalo se do 
novych a drahych velkych serveru. Bohuzel, tento pristup pouziti architektury sdileni- 
niceho vyzaduje vyssi uroven dovednosti a neni v nekterych pripadech spolehlivy. Po 
prerozdeleni dat za provozu muze napr. klesnout vykon systemu. Rozdelovani databaze 
mezi vice (levnych) stroju pridavanych dynamicky, tzv. horizontalm skalovani, muze 
patrne zajistit skalovatelnost efektivneji a levneji. Nez prizpusobovat bezne SRBD pro 
horizontalm skalovani, zda se, ze dnesni hojne citovane NoSQL databaze navrzene pro 
levny hardware a vyuzivajici rovnez architekturu sdileni-niceho mohou byt v nekterych 
pripadech fesenim jeste lepsim. Krome cloud computingu se NoSQL databaze uplat- 
nuji v aplikacich Web. 2.0 a v socialnich sitich, kde horizontalm skalovani zahrnuje 
tisice uzlu. Neni nahoda, ze nejvlivnejsi NoSQL databaze pochazeji z vyvojovych dilen 
firem Google a Amazon. DATAKON 2011 reagoval na tyto trendy v r. 2011 prispev- 
kem J. Pokomeho [18] a v r. 2014 prispevkem [19]. 

Vyznamnym rocnikem konference DATAKON byl DATAKON 2014 s tematy Big 
Data, Open Data, Linked Data. Zvane prednasky Big Data: jejich ukladani, zpracovani 
a pouziti (J. Pokomy, MFF UK), Big Data zdaleka nejsou jen “velka” data (J. Slaby, 


3 ISO/IEC 9075-14:2008: XML-Related Specifications (SQL/XML) 
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IBM) a Otevrena a propojitelna data (D. Chlapek, J. Kucera, FIS VSE a M. Necasky, 
MFF UK) rozvljely detailne tato temata, samozrejme ne pouze v databazovem kon- 
textu. Z ceske odbome literatury pro oblast Big Data lze doporucit knihu [9]. 


Tab. 2 Nove databazove architektury v poslednich 15 letech 


Milnik 

Kategorie 

Subkategorie 

Reprezentanti 

2009 

NoSQL 

klic-hodnota 

Redis 4 



sloupcove-orientovane 

Cassandra 5 

dokumentove-orientovane 

MongoDB 6 

grafove databaze 

Neo4j 7 

2005 

BDMS 

1. Generace 

Hadoop software stack 

2010 

2. Generace 

Asterix software stack 

2011 

NewSQL 

Obecne 

NuoDB 8 , VoltDB 9 , Clustrix 10 



hybridy Google 

Spanner 11 

Hadoop-relacni 

Vertica 12 , HadoopDB 13 

SQL-on-Hadoop 

Hive 14 

NoSQL s ACID 

FoundationDB, MarkLogic 15 , 
OrientDB 16 


5.3 Nove databazove architektury 

Samotne NoSQL databaze jsou sice vhodne pro urcite aplikace vyuzivajici velka data, 
na druhe strane si vsakpraxe postupne vyzadala slozitejsi databazove architektury. Ob- 
jevily se dokonce specialni SRBD nazyvane v anglictine Big Data Management Sys¬ 
tems ( BDMS ). Patri mezi ne zejmena ASTERIX 17 vyuzivajici specialni operace napr. 
fuzzy spojeni pro analyticke ucely. ASTERIX je soucasti rozsahlejsiho softwaroveho 
zasobniku se vstupnimi body na ruznych urovnich pohledu na data, od tech nejvyssich 
(dotazovaci jazyk AsterixQL), pres HiveQL, Piglet a dalsi smerem kjobum v jazyku 
Pregel (slouzi pro praci s grafy) a Hyracks na urovni prace se soubory. Podobne snahy 


4 http://redis.io/ 

5 http://cassandra.apache.org/ 

6 https://www.mongodb.com/ 

7 https://neo4j.com/download/ 

8 http://www.nuodb.com/ 

9 https://voltdb.com/ 

10 http://www.clustrix.com/ 

11 https://www.infoq.com/presentations/spanner-distributed-google 

12 http://www8.hp.com/us/en/software-solutions/advanced-sql-big-data-analytics/ 

13 http://db.cs.yale.edu/hadoopdb/hadoopdb.html 

14 https://hive.apache.org/ 

15 http://www.marklogic.com/ 

16 http://orientdb.com/orientdb/ 

17 https://asterixdb.ics.uci.edu/ 
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jsou videt i velkych databazovych firem jako ORACLE. Napr. Oracle Big Data Appli¬ 
ance kombinuje v SQL Hadoop a NoSQL v jeden dotaz SQL. V tabulce 2 jsou ukazany 
zakladni kategorie a subkategorie techto novych databazovych architektur. 

Od r. 2011 se vyskytuje pojem NewSQL databaze. Jde o vysoce skalovatelni a elas- 
ticke relacni SRBD, ktere 

• jsou navrzeny pro horizontalni skolovani na strojlch v rezimu sdileni-niceho, 

• garantujl ACID vlastnosti, 

• aplikace na nich interagujl s databazi primarne pres SQL (vcetne spojenl), 

• pouzivaji pro rizeni soubezneho zpracovanl protokol bez zamykani, 

• poskytujl vyssi vykon nez tradicni relacnl. 

Mezi obecne NewSQL patrl ClustrixDB, NuoDB (vhodny pro cloudy) a VoltDB. 

Zajlmava resenl architektur NewSQL jsou Spanner a FI vyvinute Googlem. Spanner 
pouziva hierarchie tabulek, ktere jsou semirelacemi, kde kazdy radek ma jmeno (tj. 
vzdy existuje primarni klic). FI je SQL SRBD vybudovany nad Spanner. 

Hadoop-relacni hybridy zahmuji HadoopDB a Vertica. HadoopDB je paralelni da¬ 
tabaze s Hadoop konektory transfromujici SQL dotazy do MapReduce jobu. Vertica je 
analyticky SRBD integrovany s Hadoopem dvema konektory umoznujicimi vzajemny 
prenos dat mezi HDFS a systemem pomoci MapReduce. Do kategorie SQL-on-Hadoop 
patri napr. Hive a jeho dalsi varianty. Hive byl prvnim SQL enginem na Hadoopu. 

V podnikove sfere se objevuji NoSQL s ACID vlastnostmi, nekdy tez nazyvane En¬ 
terprise NoSQL. Tyto SRBD zachovavaji distribuovany navrh, fault tolerance, jedno- 
duche skolovani a jednoduchy, flexibilni databazovy model. Co se tyce transakcniho 
zpracovanl, jde o CP (C - Consistency, P - Partition tolerance) systemy (tj. nezarucuji 
obecne dostupnost) s globalnimi transakcemi. Patri sem napr. FoundationDB, ktery je 
skalovatelnym ulozistem typu klic-hodnota, MarkLogic - dokumentove-orientovana 
NoSQL databaze vyuzivajici pro ukladani dat format JSON, HDFS, optimisticke uza- 
mykani. Distribuovany SRBD pro grafove databaze je OrientDB. 


Tab. 3 Vyznacnost NoSQL ve svete databazi z cervna 2016 


Pofacli 

SRBD 

Databazovy model 

Skore 

1 

Oracle 

relacni 

1449.25 

2 

MySQL 

relacni 

1370.13 

3 

Microsoft SQL Server 

relacni 

1165.81 

4 

MongoDB 

dokumentove-orientovany 

314.62 

5 

PostgreSQL 

relacni 

306.60 

6 

DB2 

relacni 

188.57 

7 

Cassandra 

sloupcove-orientovany 

131.12 

8 

Microrsoft Access 

relacni 

126.22 

9 

SQLite 

relacni 

106.78 

10 

Redis 

klic-hodnota 

104.49 
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V zaverecne tabulce 3 uvadlme cast rozsahlejsl tabulky vyznacnosti databazovych pro¬ 
duktu z dobre udrzovaneho serveru DB-Engine 18 (hodnotl 275 produktu). Vidlme, ze 
NoSQL MongoDB, Cassandra a Redis se objevujt v prvnt desltce. 

6 Zaver - aneb jak dal po r. 2015 

V rozvoji databazovych technologil se objevuje stale neco noveho. Napr. se hovofl o 
Extreme Big Data (EBD), tj. datech smefujlclch velikostl do YBajtu (10*24). Jak je 
ukladat a pracovat s nimi je jiste stale vyzvou zejmena na urovni jejich distribuce a 
provozu v slti. Jinym problemem je, jak vybrat nejaky produkt ci produkty do aplikacnl 
architektury s cllem integrace heterogennlch (velkych) dat z ruznych zdroju. Nabldka 
produktu s velmi odlisnymi vlastnostmi je rozsahla a navrh a sestavenr vysledne archi- 
tekty vyzaduje velkou zkusenost a znalosti. 

A co cloud computing? V souladu se soucasnym vyvojem ICT bychom mohli po- 
kracovat ve stylu Eliota a Celka [20]: 

Kde jsou databaze? 

Ztraceny v cloudu. 

Ano, databaze a ani architektura DBS nemusr byt videt. Pro uzivatele to muze b>4 vy- 
hoda. Na druhe strane realizace efektivnrho cloudu opet je a stale bude vyzvou. Obecne 
jde o Big Data, ktera jsou nejen velka, ale i heterogennr. 

A nase konference? Rok 2015 znamenal v ceske a slovenske databazove komunite 
posun v pojetl dvou konferencr do jedne nazvane Data a Znalosti. Integrace byla logic- 
kym vyustenrm aktivit zr. 2013, kdy se poprve konaly obe konference, i kdyz zatrm 
samostatne, na jednom mrste. Na rozdll od beznych odbomych konferencr je jejl pro¬ 
gram zalozen pouze na zvanych prednaskach a posterech. V prvnrm rocnrku konference 
bylo puvodne ciste databazove tema Big Data prirozene doprovazeno tematy Big Ana¬ 
lytics a pokrocila analytika. Ve zvanych prednaskach se uplatnila temata Rrzenr kvality 
dat s prihlednutlm k otevrenym a propojitelnym datum (D. Chlapek, J. Kucera, FIS 
VSE) a Vizualizace velkych dat (J. Geryk, L. Popellnsky, FI MU). 

Podekovani: Tato prace byla podporena projektem P46 Univerzity Karlovy. 
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Annotation: 

The database development and its reflection in conferences DATASEM, DATAKON 
and Data and Knowledge in years 1981 - 2016 

Historically, the oldest professional meetings called DATASEM (DATabase SEMinar) began in 
1981. A very fruitful symbiosis of experts from theory and practice was characteristic for these 
first conferences. This paper aims to show how the world development of DB technology reflec¬ 
ted and is reflected in these specialized national meetings, i.e. at twenty seminars (later conferen¬ 
ces) DATASEM, continuing the next fourteen years as DATAKON and finally, since 2015, as 
the conference Data and Knowledge. 



Skutocne potreby podnikov na zber a spracovanie 
externych dat - pripadove studie z praxe 

Filip VITEK 

Oddelenie CRM a BigData riesenl 
mediworx software solutions, a.s. 

Einsteinova 19, 851 01 Bratislava, Slovenska republika 

filip.vitekSmediworx.sk 


Abstrakt. Vyuzivanie externych datovych zdrojov, neraz Big Data charakteru, 
sa prehupla z roviny teoretickych moznostl do prvych implementacnych projek- 
tov aj v ramci stredoeuropskeho kontextu. V ramci prispevku autor sumarizuje 
ako spravne odhalit’ informacne potreby podnikov a ktore uskalia pri ich zbere a 
spracovani boli identifikovane. Na sade konkretnych prikladov zo SR a CR eko- 
nomickeho prostredia dokumentuje technologicke a informacne poziadavky pre 
komercne vyuzitie tohto druhu sluzieb ako aj rozne sposoby monetizacie zbiera- 
nia dat a ich pouzitia na biznis ciele. V zavere prispevku autor naznacuje oblasti, 
v ktorych by vyskumne a vzdelavacie institucie mohli akcelerovaf rozvoj tejto 
oblasti. 

Typ prispevku: Pozvana prednaska 

Kl’iicove slova: Big Data, komercne vyuzitie, informacne potreby, monetizacia 
udajov 


1 Uvod 

Technologie, umoznujuce pohodlne zbierat’ rozsiahle subory dat z verejne dostupnych 
serverov Internetu, sa stali dostupne pre rozsiahlu skupinu uzivatel’ov do tej miery, ze 
aj manazmenty komercnych spolocnosti a podnikov uz nevnimaju WebCrawling ako 
sci-fi funkcionality alebo vyskumne projekty cakajuce na zmysluplne spenazenie. 

Inym dolezitym faktorom, ktory akceleruje vyznam WebCrawlingu v komercnom 
prostredi je skutocnost’, ze pomocou vjrazne spopularizovanych statisticko-analytic- 
kych softwarovych nastrojov sa v obdobi od roku 1995 tie najprogresivnejsie odvetvia 
(bankovnictvo, poist’ovnictvo, telekomunikacie) dostali na horny limit informacneho 
poznania zo svojich intemych dat. Ak chcu podniky nad’alej napredovat’ v informacnej 
vyhode voci svojim konkurentom su nuteni obratit’ pozomost’ k extemym zdrojom dat. 



Skutocnepotreby podnikov na zber a spracovanie extemych dat - pripadove studie z prcixe 16 


Pre vzajomne posobenie viacerych (neskor v texte objasnenych) dovodov, je priestor 
hromadneho vyt’azovania dat a Big Data 1 rieseni atraktivnou nikou na pomedzi akade- 
mickeho vyskumu a komercnej sfery. Ako uz to byva zvykom pre nastroje, ktore su 
„uprostred“, pre ich uspesne vyuzitie je potrebne zohTadnit’ specifika oboch stran. Na- 
sledujuce state prispevku pojednavaju o tom 1) ako spravne rozpoznat’ prilezitosti, kde 
komercna sfera prejavuje dopyt po Big Data rieseniach, 2) ake nastroje su najcastejsie 
pre realizaciu Big Data projektov realizovane a 3) ako monetizovat’/nacenit’ prinosy 
sluzby pre komercny sektor. Prezentovane principy vznikli ako kultivacia projektovych 
planov jednotlivych pripadov vyuzitia (use casov), ktorych realizaciu autor v ramci ko- 
mercneho sektora v obdobi rokov 2015-2016 aktivne manazoval alebo bol oboznameny 
s podstatou ich riesenia. Zaverecne state prispevku autor adresuje vedeckym kruhom 
ako stimulaciu pre d’alsie rozvijanie oblasti Big Data a datovych sluzieb ako aj dato- 
vych produktov. 

2 Big Data nastroje - sluzby na pomedzi akademickeho 
vyskumu a komercnej sfery 

Akademicke aj politicke spicky venuju teme intenzivnejsieho prepojenia vedecko-vy- 
skumnej prace v poslednych rokoch vel’mi intenzivnu pozornost’. Zo strany komerc- 
neho sektora sa uvedena diskusia odvija najcastejsie smerom k potrebe zladenia akade¬ 
mickeho curricula s realnymi pracovnymi ulohami zamestnanca. Z pohl’adu akademic- 
kej obce sa najcastejsie pertraktuje moznost’ pretavit’ vyskumne prace do komercne vy- 
uzivanych projektov a tym zlepsit’ financovanie vysokeho skolstva, zvysit’ prestiz vy- 
skumnych teamov dosiahnutymi praktickymi aplikaciami ako aj motivaciu vyskum- 
nych pracovnikov realizovat’ projekty, ktore opustia laboratorne podmienky akademie. 

Miera prepojenia pretavenia vyskumnej cinnosti na komercne vyuzitie prirodzene 
kolise od jedneho vedneho odboru k inemu. Vo vacsine pripadov sa vsakrealizuje bud’: 

— PUSH princfp, kde vyskum ponuka pretlak novych pristupov a objavov (napr. ap- 
likovana matematika, materialovy vyskum, ...). Castym sprievodnym javov tohto 
pristupu je opatrnost’ az skepsa zo strany komercneho sektoru, kedze predstavovane 
inovacie neraz pozaduju radikalnu zmenu vyrobnych procesov alebo obsluhy kon- 
coveho klienta. Tym padom na pleciach akademie zostava marketing rieseni. 

ALEBO 

— PULL princfp, kde naopak komercny sektor aktivne vypisuje vyzvy na vytvorenie 
produktovych inovacii alebo novych metod. Spravidla vsak ide o oblasti, ktore aj 


1 Aj ked’ tento pojem na seba v roznych kruhoch prebera alternativne podoby, povodna defini- 
cia tohto pojmu pochadza z publikacie META GROUP z roku 2001: https://blogs.gart- 
ner.com/doug-laney/files/2012/01/ad949-3D-Data-Management-Controlling-Data-Volume- 
Velocity-and-Variety.pdf (viewed on 26.09.2016) 
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samotnou vyskumnou obcou doposial’ neboli plnohodnotne preskumane. Kym pre- 
behne prvotny vyskum a experimentalne overenie hypotez tak ubehnu neraz aj roky 
a komercny sektor opat’ zameriava svoju pozomost’ na ine „horuce“ oblasti. 

Z tohto pohl’adu zaujimavym zistenim je, ze prave oblast’ masovych datovych zberov 
a ich analyzy (v tomto prlspevku surname oznacovana ako Big Data projekty) je uni- 
katna vyvazenost’ou PULL a PUSH tendencil. S istou mierou odl’ahcenia mozno kon- 
statovat’, ze nielenze komercny sektor ani vedecka obec sa nedokazu sami chopit’ ini- 
ciatlvy pre intenzivnejsi rozvoj tejto temy (co byva predmetom casto aj inych vednych 
odborov), ale navyse v prlpade vednych odborov podiel’ajucich sa na Big Data projek- 
toch tu nastala vzacna rovnovaha medzi postavenlm oboch stran. Pre zdame uchopenie 
tejto sl’ubnej prllezitosti na spolupracu preto nasu pozornost’ najprv upriamit’ na do- 
vody, ktore predurcuju toto partnerstvo. 

2.1 Dovody partnerstva akademie a komercneho sektora pri rozvoji Big Data 

Napriek faktu, ze koncept 3V, zakladny pilier Big Data rozvoja, bol kodifikovany ana- 
lytikmi META GROUP uz v roku 2001, v stredoeuropskom priestore si tema Big Data 
aplikacii nasla prvotne zhmotnenie v rozvojovych projektoch komercneho sektoru az 
v obdobi 2011-2015. Hlavnjun dovodompre inovacne spomalenie v predmetnej oblasti 
bola skutocnost’, ze obchodne benefity plynuce z Big Data riesenl su najl’ahsie dosiah- 
nutel’ne u spolocnostl s vel’kymi portfoliami klientov (napr. Telekomunikacie, Bankov- 
nlctvo, Poist’ovnlctvo, Energetika). V uvedenych odvetviach vsak konkurencne pro- 
stredie v stredoeuropskom regione pocas 2001 az 2009 sa este len rozvljalo. Lokalni 
predstavitelia spomlnanych odvetvl teda neboli (na rozdiel od ich zapadnych protaj- 
skov) nutenl venovat’ intenzlvnu pozomost’ predmetnej teme a jej rozvoju. 

Pocas obdobia, ked’ zahranicne trhy zacali experimentovat’s Big Data prlstupmi, 
ich lokalni predstavitelia stavili na vyt’azenie poznatkov internych dat. Vacsina analy- 
tickeho usilia bola venovana do agregovania a datovych derivatov dat z vlastnych pro- 
cesov. Hoci tieto informacne 

Okolnosti sa uviedli do pohybu po odzneni celosvetovej financnej krizy (2008- 
2010), ked’ lokalne spolocnosti boli konfrontovane so skutocnost’ou, ze zapadnl pred¬ 
stavitelia ich odvetvi (s ktorymi boli neraz i kapitalovo prepojeni) uz naskocili na vlnu 
Big Data rieseni. Sl’ubne ekonomicke vysledky prvotnych Big Data projektov zlomili 
paradigmu, ze Big Data je len teoreticky koncept bez ekonomickeho efektu 2 . Ako sa 
ekonomika globalizuje, pre lokalnych vel’kych predstavitel’ov uvedenych odvetvi za- 
calo narastat’ riziko, ze online poskytovane sluzby Big Data by mohli dopomoct’ k zin- 
tenzivneniu trhovej agresivity mensich konkurentov veducej k strate trhoveho podielu. 


2 V odbornych kruhoch si Big Data dokonca vysluzilo hanlive prirovnanie k sexu medzi mla- 
distvymi: „Vsetci o tom hovoria, ake je to skvele, ale nik to v skutocnosti este nevyskusal.“ 
Dan Ariely, In: Clau R., Big Data! Great! Now What? http://www.slideshare.net/ricard- 
clau/big-data-great-now-what-symfonycon-2014, 2014. (videne diia 26.9.2016) 
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Dalsim podstatnym faktorom pre vznik prilezitosti pre Big Data partnerstvo je tech¬ 
nologies medzera, ktoru so sebou implementacia Big Data projektov prinasa. Pre roz- 
manitost’ a rozsah dat nie je mozne pouzit’ tradicne SQL formaty ukladania dat, je po- 
trebne siahnut’ po modernych NoSQL 3 formatoch ulozisk a Open Source . Problemom 
na strane komereneho sektora vsak je, ze dlhodobo sa spoliehal vyluene na strukturo- 
vane uloziska a proprietame software aplikacie na spracovanie a analyzu dat. Ko- 
merena sfera teda dlhodobo nebudovala l’udske zdroje schopne pracovat’s inym plat- 
formami. Naopak, akademicke prostredie sa dlhodobo (aj z dovodu licencnych nakla- 
dov) orientuje na Open Source software riesenia (R, Python, UNIX, MySQL, ...). 
Z uvedeneho hl’adiska teda mozno uzavriet’, ze akademicke prostredie je (paradoxne) 
v ovel’a priaznivejsie pripravene technologicky uchopif Big Data projekty ako ko- 
merena sfera. 

Vyssie uvedena technologicka medzera sa premieta aj do oblasti l’udskych zdrojov. 
Kedze v stredoeuropskom priestore sa uvedenej teme venuje (aj vo vzdelavacom pro- 
cese) pozomost’ priblizne 5 rokov, vzdelavaci system este nestihol vyprodukovat’ dos- 
tatoeny pocet absolventov s patrienymi zruenost’ami. Komercnym subjektom, zaziva- 
jucim zdvazny nedostatok takto k\’alifikovanej pracovnej silv, nezostava ine ako naku- 
povat’ drahe l’udske zdroje zo zahranicia alebo sa obracat’ na akademiu a aktualnych 
studentov vyssich rocnikov vysokych skol. 

Poslednjm, no nemenej dolezitym, dovodom pre priaznivu klimu na spolupracu 
akademickeho vyskumu a komerenej praxe v oblasti Big Data, je fakt, ze vacsina Big 
Data projektov si vyzaduje vyvoj komplexnych algoritmickych konstrukeii na vyt’azenie 
a naslednu post-analyzu textov a inych nestrukturovanych produktov. Vyvoj tychto al- 
goritmov si vyzaduje (neraz casovo narocne) systematicke experimentovanie s roznymi 
pristupmi. Akademicke prostredie ma moznost’ vyt’azit’ vysoky paralelizmus (rocni- 
kove prace, viacclenna vyskumna skupina, etc.) pri testovani jednotlivych pristupov, 
co je znacna vyhoda oproti pomeme priamociaremu procesu hl’adani riesenia v komerc- 
nom prostredi. 

Suhra vyssie uvedenych faktorov nahrava faktu, ze akademicke prostredie je 
v CR/SR prostredi o mnoho viac pripravene uspiet’ vo vyvoj i Big Data rieseni, ako 
komerene prostredie, ktore je vsak vystavenemu nahlemu a intenzivnemu dopytu po 
Big Data rieseni. Uvedene skutocnosti predurcuju oblast’ Big Data (a snim suvisiace 
vedne odbory) ako vhodny pre priestor pre partnerstvo vedeckeho a komereneho sek¬ 
tora na vyvoj rieseni, kde sa rastuci dopyt zo strany komereneho sektora moze byt’ na- 
sjheny relativne vhodne poziciovanymi vyskumnymi kapacitami akademie. 

Detailnejsie skumanie uvedenych kl’ucovych faktorov atraktivnosti akademicko- 
komercnych partnerstiev v Big Data oblasti zaklada predpoklad, ze v ich vplyve nena- 
stanu zasadne zmeny v horizonte 3-5 rokov. Hlavnym aspektom v tom ohl’ade bude 
dostupnost’ kvalifikovanej pracovnej sily pre Big Data zalezitosti, ktoru vyriesia aj vlny 
absolventov prinasajucich potrebne momentum na vytvorenie intemych Big Data tea- 
mov v konkretnych komercnych spolocnostiach. V horizonte 5 rokov nasledne mozno 


3 


Defmlcia NoSQL prlstupu a najpouzlvanejslch NoSQL databazovych projektov pre Big 
Data projekty sumarizovana na http://nosql-database.org/ (videne dna 

26.9.2016) 



19 Pozvcina prednaska 


predpokladat’ poklesu vyznamu Big Data partnerstiev, ako tuto ulohu preberu samo- 
statne komercne subjekty ako subdodavatelia Big Data sluzieb. 

Identifikovany potencial na spolupracu je mozne pretavit’ do skutocne uspesnych 
projektov iba ak sa spolocne projektove usilie bude pridrzat’ dolezitych principov pre 
komercne vyuzitie Big Data, ktorym sa detailne venuje najblizsia stat’ prispevku. 

3 Reflektovanie potrieb komercneho v oblasti Big Data 

3.1 Ako rozpoznat’ Big Data potreby komercneho sektora 

Azda najdolezitejsim aspektom uspesneho Big Data partnerstva vyskumnych teamov 
a komercneho sektora je zacielenie na potreby komercneho sektora v tejto oblasti. Tak 
ako v inych pripadoch, komercny sektor je ochotny priradit’ hodnotu (a teda aj finan- 
covat’) projekty, ktore priamo vplyvaju na niektoru z biznis priorit spolocnosti. Vzhl’a- 
dom na specifickost’ informacnych prinosov Big Data projektov, ako najcastejsie su 
pozadovane zo strany komercneho sektoru nasledovne oblasti: 

1. Aladzicia novych klientov. Niektore datove stopy, ktore nechavaju spotrebitelia na 
webe pri pouzivani informacnych portalov alebo socialnych sieti zakladaju moznost’ 
ich priameho oslovenia za ucelom ponuku konkretnej sluzby alebo tovaru. (napr. ak 
klient prida do otvoreneho, verejneho profilu na socialnej sieti informaciu, ze hl’ada 
radu pre vyber destinacie pre dovolenku, moze bjh’ hodnotnym potencialnym klien- 
tom pre cestovnu kancelariu alebo portal rezervujuci listky na prepravu). Pre Gen- 
rovanie zoznamu perspektivnych klientov komercny sektor oznacuje casto aj ako 
Generovanie Leadov. 

2. Profilacia vlastnych klientov. Spolocnosti neraz disponuju len limitovanymi infor- 
maciami o charaktere a zivotnom kontexte svojich klientov. Zozbieranie dodatoc- 
nych verejnych informacii o skupinach klientov (alebo priamo individualnych klien- 
toch) z diskusnych for, zaujmovych stranok alebo profilov socialnych sieti tak moze 
byf cennym nastrojom pre zlepsenie cielenia marketingovych aktivit klienta. 

3. Monitoring obchodneho spravania konkurencie. Nemala cast’ podnikania sa dnes uz 
bud’ odohrava v online predajnych kanaloch alebo aspon v online prostredi prezen- 
tuje skladbu sortimentu, cenniky alebo ine cenne obchodne informacie. Systematic- 
kym zbieranim udajov o konkurentoch prinasa danemu komercnemu subjektu infor- 
macnu vyhodu, ktoru vie pretavit’ do dodatocnych trzieb alebo uspory nakladov. 

4. Profilacia externych obchodnych partnerov. Tak ako mozno vo verejnych castiach 
webu zbierat’ informacie o koncovych zakaznikoch podniku, je mozne profilovat’ 
extemymi udajmi o dodavatel’och alebo inych dolezitych obchodnych partnerov. 
Zdrojom dat pre tento druh Big Data projektov zvacsa byvaju verejne registre, od- 
vetvove stranky alebo webove sidla samotnych profilovanych subjektov. 

5. Prevencia odchodu klientov. Tak ako je pre podnik dolezite ziskavat’ novych klien¬ 
tov, je biznisovo hodnotne udrzat’ si uz ziskanych klientov. Nez sa klient rozhodne 
zmenit’ dodavatel’a svojich sluzieb alebo tovaru, zvykne najprv preskumat’ ponuky 
konkurencnych spolocnosti, pripadne vyuzije sluzby cenoveho porovnavaca. Preto 
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detegovanie interakcie klienta s konkurenciou moze byt’ pre komercny subjekt hod- 
notnym impulzom pre komunikaciu o vhodnosti aktualnej formy produktov a slu- 
zieb voci danemu klientovi. Vhodnymi datovymi zdrojmi pre tento druh projektov 
byvaju socialne profily alebo web stranky konkurencie alebo data z cenovych po- 
rovnavacich sluzieb. 

Vyber konkretneho prinosu je zvacsa determinovany aj dostupnost’ou datovych zdrojov 
pre niektore z uvedenych biznis motivatorov. Spolocnosti casto preferuju promptne do- 
stupne neuplne data s umyslom ziskat’ informacnu vyhodu a az nasledne doladit’ pre- 
ciznost’ datovych odporucam. Pred spustenim konkretneho partnerstva sa odporuca pre- 
skumat’ dostupnost’ (a datovu kvalitu) pre kazdy z biznis prinosov, kedze niektore da- 
tove zdroje mozu poskytovat’ vstupy aj pre niekol’ko ciel’ov sucasne. 

3.2 Najcastejsie technologie a nastroje pre riesenie komercnych projektov 

Kazdy seba lepsi projektovy plan je nacrtom pokial’ nie su dostupne potrebne nastroje 
na realizaciu danych funkcionalit. Za posobenie autora vteame specializujucom sa na 
Big Data projekty mozno skonstatovaf, ze najcastejsimi pozadovanymi nastrojmi ko¬ 
mercnych Big Data projektov su nasledovne: 

1. Web crawling, masovy zber extemych dat. Absolutna vacsina projektov ako jeden 
z prvych krokov vyzadovala vybudovanie automatizovaneho robota na zber dat z de- 
finovanych zdrojov. Z technologicke hl’adiska zvacsa islo mini aplikacie v Python, 
Java Script alebo inom na web orientovanom algoritmickom jazyku. Pre ucely vyu- 
zitia paralelizmu bude nutne pre vyskumne teamy zvladat’ virtualizaciu serverov ako 
aj orchestraciu vacsieho poctu serverov pomocou nastrojov ako Chronos. 

2. No SQL ulozisko zozbieranych dat. Pri zbierani extemych dat ma vacsina vystupov 
povahu textu, obrazkov, audio alebo video stopy. Pre efektivne uchovavanie tychto 
datovych formatov nepostacuju strukturovane (SQL) databazove uloziska aje po¬ 
trebne poskytnut’ NoSQL uloziska. Z technologickeho hl’adiska ide primame 
o Open source produkty a riesenia (pre eliminaciu licencnych nakladov) ako Cassan¬ 
dra, Mongo DB alebo HDFS. 

3. Text mining algoritmy. Kedze najcastejsim vystupom web crawlingu su textove po- 
lia, azda najdolezitejsim analytickom komponentom su nastroje na dekompoziciu 
textu, jeho tagovania a nasledne parovanie a vzajomna suvislost’ textovych refazcov. 
Hoci zakladne baliky textovych analyz (stemming, TF-IDF, Ngrams, ...) poskytnu 
vhodny zaciatok pre vacsinu uloh, ale neskor bude potrebne programovat’ aj indivi- 
dualne rutiny pre optimalizaciu (hlavne) parovacich algoritmov. 

4. Semanticka analyza. Popri samotnej funkcnej analyze textovych ret’azcov sa casto 
objavuje poziadavka na semanticku/emotivnu analyzu dat, prisudzujucu jednotlivym 
castiam textu odtienok nalady alebo postoja klienta. 

5. Machine learning algoritmy. Popri analyze textu je druhou najcastejsou ulohou 
v Big Data projektoch Asociacna analyza (rule generation). Pre strukturovane data 
su casto pozadovanymi komponentami autonomne, machine learning algoritmy pre 
klasifikaciu objektov alebo pre vypocty pravdepodobnosti (rozhodovacie stromy, 



21 Pozvcina prednaska 


neuronove siete). Druha skupina zmienovanych algoritmov si vsak vyzaduje aj de- 
tailnii znalost’ biznis prostredia a preto zriedkavejsie byva poniikana do externych 
partnerstiev na Big Data projekty. 

6. Specificke non-text analyzy. Sofistikovanejsie Big Data projekty zvacsa prekonajii 
prizmu textovych analyz a budii pozadovat’ komplexnejsie algoritmy ako detegova- 
nie pattemov v obrazkoch, ich vzajomne stotoznovanie, pripadne dekompozicia au¬ 
dio alebo video stop, pripadne parovanie casovo urcenych dat s video stopami. 

Hoci akademicke vyskumne teamy prisli nepochybne do kontaktu aj so sofistikovanej- 
simi formami analytickych rutin a v mnohych z progresivnych vetiev prebieha inten- 
zivny primamy vyskum, pre podporu Big Data partnerstiev by sa vyskumne teamy mali 
zamerat’ na zvladnutie a kapacitne vystuzenie vyssie uvedenych 6 oblasti Big Data. 


3.3 Sposoby monetizacie datovych analyz v Big Data prostredi 

Popri technologickych zmenach a specifickych oblastiach komercneho prinosu Big 
Data projektov si projekty v tejto oblasti vyzaduju aj implementaciu novych sposobov 
monetizacie (spenazenia) vystupov analytickeho prostredia. Nasledujuca stat’ popisuje 
zakladne principy 4 najcastejsfch sposobov monetizacie Big Data vystupov. Voiba 
vhodneho monetizacneho modelu je dolezitjun predpokladom 

(Jednorazova) Vyskumna uloha. Pokial’ predmetom analyzy je konkretny casovy 
rez dat alebo su hromadne spracovavane historicke, pripadne v case stabilne datove 
vstupy, je vhodne monetizovat’ vysledky formou vyskumnej lilohy. Pre vacsinu spo- 
lupracujucich komercnych subjektov sa tak akceptovatel’na cena projektu pre vy- 
skumnu ulohu odvija od alternativnych intemych nakladov nutnych pre realizaciu danej 
ulohy. Pre kalkulaciu na strane akademie je mozne pouzit’ hodinove (resp. manday) 
sadzby bezne v IT sektore podporujiicom daneho koncoveho uzivatel’a datovych vystu¬ 
pov. 

Opakujuca sa datova sluzba. Ak je ciel’om projektu poskytovat’ mapovanie dyna- 
mickych dat, ich zmeny v case alebo identifikovanie zmien v stave objektov alebo su- 
visiacich textov, nie je vhodne monetizovat’ vystupy pomocou jednorazovych vyskum- 
nych uloh, ale postavit’ projekt ako dodavku kontinualnej sluzby (zberu a analyzy dat). 
Nastavenie ceny datovej sluzby je vsak potrebne uz relativizovat’ k biznis prinosom, 
ktore ma sluzba v case prinasat’ (napr. objem dodatocnych trzieb alebo usporenych na¬ 
kladov). Potrebne je zohl’adnit’ aj ziskovu marzu daneho odvetvia 4 . Sluzba je zvycajne 
dodavane vo forme mesacnych, kvartalnych alebo rocnych poplatkov za pouzivanie 
sluzby. Pre zodpovedne nacenie je potrebne zohl’adnit’ fakt, ze kontinualna povaha 
sluzby uz si bude vyzadovat’ servisovanie (monitorovanie dostupnosti sluzby pre kon¬ 
coveho klienta a korekcie algoritmov v case). 

Datovy produkt. Pokial’ data prechadzajii aj vyssimi urovnami spracovania (napri- 
klad analyza statistickeho rozdelenia zbieranych velicin), je mozne dodat’ vystupy ana¬ 
lyzy aj ako datovy produkt. Na rozdiel od datovej sluzby, v pripade datoveho produktu 


4 


Udaje zverejiiovane statistickym uradom alebo na portali FinStat, vid’: https://www.fin- 
stat.sk/analyzy/financne-ukazovatele-slovenskych-firiem (videne dna 26.9.2016) 
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vyskumny team neposkytuje komercnemu subjektu pristup k datam, ale predava uce- 
lenu datovu sadu („datovu kocku“), zahrnujucu vsetky mozne hodnoty (napr. hodnoty 
vsetkych krvnych tlakov v populacii). Datovy produkt je vhodnou formou monetizacie, 
ak sa predpoklada, ze koncovy uzivatel’ tohto datoveho produktu potrebuje integrovat’ 
datovu kocku do svojich systemov alebo ju bude d’alej v inej forme predavat’ svojim 
klientom. Pre necenenie datoveho produktu sa zvycajne pouziva ekvivalent niekol’ko 
rocneho pouzivania datovej sluzby, ktora by poskytla rovnake udaje. Ako vhodne ob- 
dobie pre vypocet je zvolif obdobie, po ktore sa predpoklada akceptovatel’na miera 
aktualnosti danych napocitanych dat. 

Licencia na algoritmus. Na rozdiel od predaja samotnych datovych vystupov, moze 
sa vyskumny team rozhodnut’ monetizovat’ prava na vyuzlvanie Big Data algoritmu. 
Tento - v nasom regione zatial’ menej pouzlvany sposob monetizacie - moze bjh’ zdro- 
jom zaujimavych financnych prostriedkov pre vyskumny team aj publicity v expertnej 
komunite. Je vsak dolezite poznamenat’, ze na to, aby mohol byt’ algoritmus predavany 
ako nehmotny majetok (dusevne vlastnictvo) musi byt’ chraneny patentom alebo licenc- 
nymi zmluvami, ktore vjrazne navysuju naklady na spustenie spoluprace. Zaroven 
tento sposob monetizacie je mozne pouzit’ iba voci partnerom, ktori su schopni vo vlast- 
nej infrastrukture spust’at’ a servisovaf licencovany algoritmus, ako aj odpisovat’ na¬ 
klady ako dlhodoby nehmotny majetok (spravidla do 5 rokov odpisov). Uvedene spe- 
cifika vjrazne limituju aj zoznam potencialnych partnerov pre tento druh monetizacie. 

Vo vseobecnosti mozno povedat’, ze vacsina Big Data partnerstiev sa opiera o mo- 
netizaciu bud’ v podobe Datovej sluzby, pripadne vo forme Datoveho produktu. Mone- 
tizacia formou Vyskumnej ulohy je sice univerzalnou formou, pouzitel’nou takmer pre 
l’ubovol’ny Big Data projekt, je vsak najmenej vyhodnou formou monetizacie pre rea- 
lizatora zberu a analyzy dat. Preto by nemala predstavovat’ prvu vol’bu pre partnerstva. 

3.4 Oslovenie vhodneho komercneho partnera pre Big Data projekt 

Po tom, co sme sa oboznamili s biznis prinosmi a najcastejsie pozadovanjmi kompo- 
nentmi, je dolezite este rozvazit’ akym sposobom sa vedecke teamy pokusia o nadvia- 
zanie partnerstiev s potencialnymi odberatel’mi spomedzi radov komercneho sektora. 
V priestore CR a SR Big Data oblasti mozno odporucif pre hl’adanie nasledujuce dva 
pristupy: 

— Projekt za pomoci IT dodavatel’a. Pre vacsinu projektov je potrebne ziskane vy- 
stupy Big Data funkcionalit este vsadif do prezentacnej vrstvy (napriklad BI portal) 
alebo zintegrovat’s niektorou sexistujucou aplikaciou koncoveho zakaznika. Vy¬ 
skumny team zriedka disponuje zdrojmi, ktore by systematicky mohli pracovat’ aj 
v priestoroch koncoveho zakaznika, preto (okrem pripadov uvedenych v nizsom 
bode) je zrejme najvhodnejsou formou spoluprace realizovat’ Big Data projekt ako 
subdodavatel’ niektoreho z IT dodavatel’ov koncoveho zakaznika. Vyhodou tohto 
pristupu je ak fakt, ze IT dodavatelia maju zvacsa rozvinute nastroje marketingu IT 
sluzieb aj nad ramec pilotnych koncovych zakaznikov. 
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Priamy projekt s koncovym odberatePom. Pri naplneni urcitych podmienok je 
vsak napriek tomu mozne pre vyskumny team pracovat’ priamo s koncovym odbe- 
ratel’om. Vhodnymi subjektmi na priamu spolupracu su spolocnosti, ktorych pod- 
stata podnikania sa odvija od IT produktov/sluzieb (prevadzkovatelia portalov, apli- 
kacii alebo IT dodavatelia), pripadne signifikantna cast’ ich podnikania je v online 
priestore (elektronicke obchody, agregatory a rezervacne portaly, ...). u tychto sub- 
jektov mozno predpokladat’, ze ich miera porozumenia IT trendom bude dostatocna, 
aby vedeli byt’ partnerom pre expertnu specifikaciu nastrojov a ciel’ov Big Data pro- 
jektov. V ostatnych pripadoch je vhodne pouzit’ IT firmu ako prostrednika. 

Z hl’adiska vseobecnych charakteristik vhodnych subjektov pre Big Data projekty je 
mozne specifikovat’ nasledovne faktory zvysujuce pravdepodobnost’ zaujmu o Big 
Data sluzby: 

1. Vel’ky pocet koncovych klientov, dodavatel’ov alebo partnerov 

2. Silne konkurencne prostredie v odvetvi 

3. Multi-channel obsluha koncoveho klienta alebo predaj produktov 

4. Vyssia frekvencia transakcii (nakupov) koncovych klientov 

4 Realne Big Data projekty realizovane v CR a SR priestore 

Dokumentovane principy a odporucania pre sferu vyskumno-komercneho partnerstva 
v oblasti Big Data, ktore su predmetom ostatnych kapitol tohto prispevku, vnikali po- 
stupne ako destilacia opakujucich sa principov a predpokladov realne implementova- 
nych projektov v komercnej sfere. Hoci pre nastavenie spoluprace s komercnou sferou 
su dolezite primame uvedene principy, pre lepsie pochopenie postojov komercnej praxe 
je hodnotne vidiet’ uvedene pravidla zasadene priamo v konkretnych prikladoch 
Detailny popis jednotlivych projektov je obsahovo nad ramec tohto prispevku. Autor 
sa preto spol’ahol na kondenzovany popis zakladnych ciel’ov a vyskumnych metod p. 
Na zaver kazdeho z projektov je uvedena ta podmnozina identifikovanych principov 
spoluprace vedeckych pracovnikov, ktora bola obzvlast’ dolezita v danom projekte. 

4.1 Pripadova studia: „Cenovy monitoring konkurencie“ 


Konkurencny E-shop 



Obr. 1. Schema dopytovania a analyzovania 
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Odvetvie: Online predaj kozmetiky a vyzivovych doplnkov 

Ciele projektu: Na dennej baze monitorovat’ a reportovat’ spravanie konkurencnych 
portalov predavajuci obdobny sortiment. Identifikovat’ prekryv produktovej ponuky 
s domovskym e-shopom. Hlasif akekol’vek zmeny v cenovej politike niektoreho z por¬ 
talov ako aj zalistovanie novych produktov konkurenciou. 

Pouzite techniky: Automaticky web-crawling, Text mining. Testy na zhody texto- 
vych ret’azcov, Analyzy casovych radov 

Najobt’aznejsie elementy: Parovanie textovych ret’azcov nazvov produktov (vac- 
sina portalov nepouzivala ziadne unikatne ID produktov) a presne parovanie produktov 
bolo pomerne komplikovane, pretoze tie iste produkty su vjTabane vo vel’kom mnoz- 
stve podobnych mutacii. Navyse z toho isteho produktu su vyrabane rovnake balenia 
s rozlicnou ucinnou latkou alebo rozlicnym objemom balenia. Standardne textove pa- 
rovacie algoritmy vykazuju vysoku mieru false positive parovani. 

Princfpy: Hodnota pre klienta = monitoring konkurencie, cenove spravodajstvo. 
Forma monetiza.de = kontinualna datova sluzba, pravidelny (mesacny) poplatokza po- 
uzivanie sluzby. Dodatocne komplikacie = Paralelizmus pre maskovanie frekvencie do- 
pytovania sa na konkurencnych portaloch 

4.2 Pripadova studia: „Monitorovanie prepoistenia klientov“ 


Vlastna 

databaza klientov 



Obr. 2. Schema dopytovania pre monitorovanie prepoistenia klientov 
Odvetvie: Komercne poistenie majetku - poistenie vozidiel 

Ciele projektu: Majitelia motorovych vozidiel maju raz rocne (na vyrocny den 
zmluvy) moznost’ vypovedat’ zmluvu o Povinnom zmluvnom poisteni (PZP) svojho 
vozidla. Kedze poistit’ si svoje motorove vozidlo je podl’a zakonna povinne, kazdy 
klient, ktory vypovie zmluvu sa must pod hrozbou pokuty od uradov poistit’ bezod- 
kladne poistit’ u inej poist’ovne. Preto kazdy strateny PZP klient je automaticky klien- 
tom inej poist’ovne. Na portali www.skp.skje dostupna sluzba, v ktorej si ako ucastnik 
dopravnej nehody mozete overit’, v ktorej poist’ovni je poistene auto, s ktorym ste mali 
nehodu. (podl’a SPZ daneho auta). 

Systematickym dopytovanim na SPZ odidenych klientov je mozne zistit’, ktora pois- 
t’ovna ich odlakala. Systematickym dopytovanim na cely trh je mozne dosiahnut’ trhove 
spravodajstvo o tom ako sa zmenilo portfolio konkurentov alebo z akej konkurencnej 







25 Pozvana predndska 


poist’ovne sa podarilo nam odlakat’ klientov. Porovnanim cenovych hladin a pohybov 
klientov je mozne testovat’ cenovu elasticitu klientov pri zmene poistenia. 

Pouzite techniky: Automaticky web-crawling, Text mining, Algoritmus na efek- 
tivne dopytovanie celeho trhu vozidiel. 

Najobt’aznejsie elementy: Text mining detailnych popisov vozidiel a ich parovanie 
na cenniky konkurencie. Extrakcia cennikov konkurencnych cennikov poist’ovni (vy- 
soke mnozstvo kombinacii tarif). 

Princfpy: Hodnota pre klienta = monitoring konkurencie. 

Forma monetizacie = analyticka vyskumna uloha 


4.3 Pripadova studia: „Financne spravodajstvo o platitel’och poistenia“ 



Obr. 3. Schema dopytovania pre Financne spravodajstvo o platiteFoch poistenia 

Odvetvie: Poistenie osob hradene pre zamestnancov zo strany zamestnavatel’a 
Ciele projektu: Pri poisteni, ktoreho platitel’om poistneho su pravnicke osoby, je 
klucove monitorovat’ financne zdravie platitel’ov poistneho. Poistne je totiz casto vni- 
mane ako jedna z prvych nakladovych poloziek, ktorej sa spolocnosti vo financnej nu- 
dzi rozhodnu vzdat’. Ciel’om projektu bolo identifikovat’ pravnicke osoby, ktore sa 
pravdepodobne dostanu do financnych problemov. Ako zdrojove data bob pouzite ve- 
rejne registre sudnych rozhodnuti a registre financnych vykazov 

Pouzite techniky: Automaticky web-crawling. Text mining, Semanticka analyza 
Najobt’aznejsie elementy: Extrakcia kontextu sudnych rozhodnuti pre pochopenie 
kontextu a zavaznosti rozhodnutia pre dopad na financie daneho subjektu. 

Principy: Flodnota pre klienta = profilacia vlastnych klientov. Forma monetizacie 
= kontinualna datova sluzba, pravidelny (rocny) poplatok za pouzivanie sluzby 

5 Zaver 

Viacere trhove faktory sposobili, ze oblast’ vednych odborov dotykajucich sa Big Data 
oblasti je priaznivo poziciovana na zakladanie partnerstiev s komercnjun sektorom na 
vyvoj Big Data projektov. Pre skutocne vyt’azenie uvedeneho potencialu je potrebne 
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ddsledne poznat’ hlavne prudy dopytu komercnej sfery po predmetnych sluzbach ako 
aj sadu nastrojov, ktore umoznia zdarnu realizaciu vytycenych Big Data projektov. Pri- 
spevok okrem pomenovania vyssie uvedenych principov projektov doplna aj vymedze- 
nie najcastejsich modelov monetizacie Big Data vystupov, ako aj odporucanie pre hl’a- 
danie vhodnych partnerov z komercneho sektora. Vsetky menovane prvky su nasledne 
ilustrovane v serii pripadovych studii realizovanych v CR a SR podnikatel’skom pro- 
stredi. 

Pri aplikovani predstavenych principov, vyskumne pracoviska maju unikatnu sancu 
pozdvihnut’ mieru integracie akademickej vyskumnej cinnosti s realnou biznis praxou, 
ktora je predmetom vizii ako akademickej tak i politickej reprezentacie. Kl’ucove fak- 
tory zvysujuce prit’azlivost’ spoluprace s komercnym v Big Data oblasti maju predpo- 
klad zotrvat’ v platnosti v horizonte 3-5 rokov, po uplynuti ktorych je pravdepodobne 
predpokladat’ masove rozsirenie Big Data sluzieb prislusnymi odvetviami samotneho 
biznis prostredia. 
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Annotation: 

Real needs of business entities in mass collecting and processing of external data — use 
cases of real CEE projects 

Topic of external data, often Big Data like, usage has matured also in CEE region from theoretical 
concepts to first successful implementation projects. Author has taken part in several projects 
aimed at mass crawling of the external data for business sector. Following article summarizes 
most common informational needs of CEE businesses and principles necessary to withhold in 
forging successful Big Data partnerships between academic research bodies and business entities. 

Article depicts technological and data requirements through set of real projects realized in CEE 
region within 2014-2016 period. Separate attention is dedicated to competition monitoring and 
collaborative filtering as the most common use cases of Big Data commercial projects recently. 
In the final part author indicates how to monetize Big Data services and products and indicates 
areas where universities can accelerate Big Data implementation wave. 
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Abstrakt. Interaktivna vizualizacia dat je dolezitou sucast’ou analytickych pro- 
cesov, kedy uspesne prepaja vypoctovy vykon strojov s inteligenciou a skusenos- 
t’ami l’udi. Radikalne zmeny (Big Data, udaje v realnom case, bezpecnostna po- 
litika, socialne siete atdV) vsak predstavuju nove vyzvy aj pre overene postupy a 
musime na ne vediet’ reagovat’. Sme preto svedkami evolucnych aj revolucnych 
zmien v oblasti vizualizacie. Prispevok predstavi tieto sucasne trendy, pomenuje 
nove vyzvy v oblasti interaktivnej vizualnej analyzy dat a nacrtne atraktivne 
smery pre buduci vyskum. 

Typ prispevku: Pozvana prednaska 

Kl’iicove slova: vizualizacia dat, interakcia, heterogenne data 


1 Informacie v protismere 

Sme svedkami radikalnych zmien vo svete informacii. Informacna superdial’nica, ktora 
sa objavila na prelome tisicroci s nastupom internetovych technologii, bola v zaciat- 
koch - pouzijuc dopravnu analogiu - iba jednosmerkou. Distribuovala data ulozene 
v digitalnych obdobach analogovych kniznic k citatefom. Bola to sice revolucia v pri- 
stupe k informaciam, porovnatel’na s vynalezom knihtlace, ale skutocna zmena para- 
digmy prisla az v sucasnej dobe. Otvorenim opacneho smeru na informacnej superdial’- 
nici. 

Big Data, socialne siete, datova zumalistika, GPS, Internet veci... to je len niekol’ko 
pojmov, ktore zmenili to, ako pristupujeme k datam, ako ich spracuvame, analyzujeme 
a akjnn sposobom z nich ziskavame poznatky o svete. 

Data, s ktorymi dnes pracujeme, rastu v objeme vd’aka technologiam pre ich zber 
a uchovavanie. Integraciou dat z roznych zdrojov vznika heterogenne prostredie vyzna- 
cujuce sa vel’kym rozsahom dat, ich vel’kou variabilitou a narastajucou dolezitost’ou 
metadat. Zaroven vyvstava otazka spol’ahlivosti dat, ktore vznikaju mimo kontrolova- 
neho laboratomeho prostredia, v divocine digitalneho sveta. 
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2 Horuca datova poda 

Mohlo by sa zdat’, ze pre datovu vizualizaciu nie su zmeny v datovom priestore nijak 
zasadne. Koniec-koncov, data su pre vizualizaciou len vstupom a algoritmy by mali 
spol’ahlivo fungovat’ na vsetkych pripustnych vstupoch. 



Obr. 1. Vizualizacna pipeline [2] 


V skutocnosti vsak maju spominane revolucne zmeny vyznamny dopad na vsetky etapy 
procesu vizualizacie (Obr. 1). 

Obrovsky objem spracovavanych dat spomal’uje proces filtrovania dat pouzivatel’om 
a rendering, cim trpi hlavne interakcia s clovekom. 

Integracia dat z heterogennych zdrojov (napr. spajanie numerickych, textovych, 
vzt’ahovych, casovych a geografickych udajov) predstavuje vel’ku vyzvy v procese ma- 
povania, cize pri prenose udajov do ich grafickej podoby: na pozicie, tvary, farby... 

Nutnost’ zohradnovat’ nove atributy dat, akymi su pravdepodobnost’ ci doveryhod- 
nost’, vytvara nove poziadavky na mapovanie a rendering [5]. 

A nesmieme zabudat’ na najdolezitejsi element v celom procese: na cloveka, ktory 
stoji na konci vizualizacnej pipeline ako prijimatel’ obrazoveho vystupu a ktory pro- 
strednictvom interakcie moze vstupovat’ do vsetkych etap celeho procesu. Kognitivne 
a perceptualne obmedzenia l’udskeho organizmu zostavaju rovnake bez ohl’adu na na- 
predovanie technology 

3 Inovacie pre vizualizaciu 

Spominane vyzvy su v domene vizualizacie informacii vo vacsine pripadov uz zname 
a jednotlive problemy boli skumane uz davnejsie. Napriklad zobrazovanie heterogen¬ 
nych dat pomocou prepojenych zobrazeni [4], rendering dat zohl’adnujuci pravdepo¬ 
dobnost’ [1] ci vel’ke data [6] su temy, ktorym sa vyskum v oblasti vizualizacie venuje 
uz dlho. No az narast poctu praktickych aplikacii, narast objemu realnych dat a vyskyt 
viacerych doteraz separatnych problemov sucasne vjednej ulohe preveria navrhnute 
postupy v aktualnych podmienkach. 

Atraktivne bude sledovat’ novy vyskum hned’ v niekol’kych oblastiach: 

3.1 Kolaborativna vizualna analyza 

Vyuzitie viacerych analytikov ci expertov z roznych oblasti je sl’ubnym riesenim prob¬ 
lemov sposobenych narastom objemu a variability analyzovanych dat. Zdiel’anie po- 
znatkov medzi ucastnikmi kolaborativnej analyzy predstavuje celkom nove prostredie 
pre vyuzitie vizualizacie. 
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3.2 Metadata 

Informacie o spol’ahlivosti zdroja dat, doveryhodnosti, datume ziskania ale aj mnohe 
ine metadata su dolezite pri posudzovanl dat a dopyt po zohl’adneni metadat vo vizua- 
lizacii bude rast’. 


3.3 Focus+Context 

Rozdelenie dat na focus a kontext je znamy a dobre osvedceny postup pouzivany pre 
prehl’adne zobrazenie vel’kych alebo zlozitych dat [8]. Bude zaujimave sledovat’jeho 
aplikaciu na heterogenne data, kedy sa focus od kontextu nemusl lisit’ len inou urovnou 
detailu, ale aj inym zdrojom dat ci uplne inymi typom mapovania dat. 

3.4 Vyuzitie umelej inteligencie 

Rozmiestnovanie vrcholov grafu, usporiadanie osl v diagrame, vyber mapovanych da- 
tovych atributov a ine parametre vizualizacie maju vel’a moznych konfiguracil [9], 
Algoritmy umelej inteligencie sa od l’udskych pouzlvatel’ov mozu ucif, ktore konfigu- 
racie su vhodnejsie od inych a urahcit’ tak analytikom d’alsiu pracu. 

4 Naco to vsetko? 

Tieto a mnohe d’alsie vyzvy a inovacie budu sprevadzat’ oblast’ vizualizacie informacii 
v obdobi po nedavnej zmene datovej paradigmy. Data prichadzaju z nekontrolovatel’- 
nych zdrojov, v bezprecedentnom objeme a casto treba reagovat’ rychlo. Z oblasti dat 
a znalosti tak celime niekol’kych zasadnjm vyzvam, ktore zaroven tvoria motivaciu pre 
d’alsi vyskum a vyvoj v oblasti vizualizacie informacii: 

1. Technicka vyzva: Narast objemu spracovanych dat, integracia dat z roznych zdrojov 

2. Technologicka vyzva: Potreba kolaboracie a integracie v analytickom procese 

3. Politicka vyzva: Narabanie s dezinformaciami a datovo-orientovana zumalistika. 

Narast objemu spracovanych dat sa technicky dari zvladnut’ zlepseniami hardveru 
a data mining algoritmov. No ako povedal John Stasko: „Data mining pouizite, ked'po- 
znate otcizku. Pouzite vizualizdciu, ked’otcizku nepoznate" [7] 

Kolaborativny a integracny proces prebieha v komunikacii medzi roznymi expertmi 
a analytikmi riesiacimi. Zatial’ nikto nevymyslel efektivnejsi sposob komplexnej infor- 
macnej v^meny medzi l’ud’mi nez je obraz. 

Narastu propagandy v mediach treba celif pomocou kvalitnej datovo-orientovanej 
zurnalistiky. A ta sa nezaobide bez infografiky a datovej vizualizacie. 
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5 Zaver 

Vizualizacia informacii zostava dolezitym nastrojom pri objavovani netuseneho, ove- 
rovani tuseneho a komunikovani overeneho. Vyvoj v oblasti ziskavania, spracovania 
a spristupnovania dat tak zakonite vytvara nove vyzvy aj pre vizualizaciu. V prispevku 
sme predstavili niektore z tychto vyziev ako aj viacere atraktivne vyskumne smery, 
ktore na ne reaguju. 
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Annotation: 

Information visualization - current challenges and trends 

The increase in data sources and data volume introduces new challenges for information visual¬ 
ization: integration of heterogeneous data sources, large data, uncertainty in data, real-time 
streaming data etc. While the visualization domain is already familiar with many potential solu¬ 
tions, e.g. linked views, focus+context visualization, or density-based visualization, the abrupt 
changes in data domain will put these techniques to the real test. And the need for some new 
techniques arises, such as integration of artificial intelligence into visual analytics and collabora¬ 
tive analysis among multiple human experts. 
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Abstract. Traditional data mining algorithms typically assume data instances to 
be independent. Flowever, there is a lot of real-world scenarios where relation¬ 
ships between data instances exist and they are principal for data understanding. 
For example, there are relationships between people in social networks, between 
chemical elements in chemical compounds, etc. It is difficult or even impossible 
to express such information in the classical attribute-value representation. Graph 
mining is an area of data mining that uses a graph representation of data and it 
allows us to exploit the relationships in the data. The goal of this talk is to present 
diverse successful applications of graph mining on real-world graphs. 

Talk type: Invited talk 

Keywords: graph mining, network analysis, data mining, classification, anom¬ 
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1 Introduction 

Graph mining is an area of data mining in which data is presented as a graph or a set of 
graphs [2], Compared to regular attribute-value representation, graphs allow us to 
model dependencies between individual entities. These graphs can be either static or 
dynamic, i.e. they change through time. Nodes and edges can also have attributes as¬ 
signed. In this paper, we present several applications of graph mining on real-world 
graphs. 


2 Classification of Nodes 

The first application we would like to present is node classification in graphs. It is as¬ 
sumed that particular nodes have a class assigned and the task is to train a model for 
class prediction of other nodes. These models typically utilize the network structure and 
classify nodes by using their neighbour nodes or the nodes that are similar with regard 
to a defined measure. Structural neighbourhood-based classifier (SNBC) algorithm [3] 
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generalizes this basic scenario by allowing multiple classes for each node. By using 
random-walk technique, the algorithm was able to classify scientific publications on 
the basis of citation network, categorization of books on the basis co-purchasing net¬ 
work, etc. 

3 Anomaly Detection in Recommendation Networks 

The next application is treated as a classification task as well, but now the edges are 
classified [4], More precisely, the goal is to decide which edges are anomalous and 
which are not in order to improve the performance of a geospatial recommender, 
Google Related Places Graph. This recommender system recommends similar or close 
places (businesses, sights, etc.) for a given place searched via Google Search Engine 1 . 
It uses a network in which similar places are linked together. By detecting and remov¬ 
ing anomalous edges, the authors were able to filter out plenty of irrelevant recommen¬ 
dations. Anomaly detection is carried out by Random Forests classifier [1] that uses 
various structural features extracted from the network as well as features from Google 
Knowledge Graph [6]. 

4 Anomaly Detection in Communication Networks 

Detection of anomalous patterns in dynamic networks is presented in [7], Patterns are 
represented by subgraphs that change into other subgraphs in the next moment. By a 
change, we mean addition or deletion of vertices or edges, change of labels, or a com¬ 
bination of these elemental changes. Thus, the patterns express the changes on the local 
level of the network. 

An example of two patterns from an email-correspondence network is shown in Fig. 
1. The nodes represent employees (Emp = regular employee, VP = vice president) and 
the links represent sent emails. The left part of the patterns depicts the communication 
on one day and the right part the next day. Frequently occurring patterns are marked as 
normal whereas deviations from these patterns are marked as anomalies. More specifi¬ 
cally, the deviations occur only in the right part of the pattern. The normal patterns 
capture the common evolution of the subgraphs and they also serve as an explanation 
of the anomaly deviation. Besides the analysis of email communication, the method 
was also used to analyse the graphs of resolution proofs created by computer-science 
students. 


i 


http://www.google.com. 
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Anomaly pattern: 



Explanation pattern: 
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Fig. 1 . Anomalous communication pattern and its explanation. 


5 Community Detection in Voice-Call Networks 

The last application of graph mining is concerned with community detection in a net¬ 
work built from voice calls. Community detection is a process of node clustering in 
which nodes clustered together are densely connected in the original graph. This work 
was created during a project in Gauss Algorithmic s.r.o. for a telecommunication com¬ 
pany. The nodes in this network represented phone numbers and the edges represented 
calls between the numbers. More precisely, the edges were obtained by aggregating 
calls over a longer period of time and weighted by call duration statistics. Label Prop¬ 
agation Algorithm [5] modified for weighted graphs was used for community detection. 
Subgraphs formed by discovered communities came with various sizes and shapes. Re¬ 
sulting communities are going be used for improving customer experience and for 
chum prediction. 

6 Summary 

In this work we presented four different applications of graph mining on real-world 
datasets. This is merely a tiny fraction of graph mining. There are many other tasks in 
this area, such as frequent pattern mining, graph modelling, graph summarization, or 
link prediction. 

Acknowledgements: We would like to thank the organizers for the invitation, Lubos 
Popelinsky and Gauss Algorithmic s.r.o. for support. 
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Abstrakt. Tema otvorenych udajov v SR zaznamenava zmiesane reakcie a vy- 
sledky, vo vseobecnosti vsak napreduje a z temy ktora bola okrajovou az nezna- 
mou sa za par rokov stala tema bezne akceptovana. Pokial’ ide o realne publiko- 
vanie, caka SR este mnoho prace, zaznamenali sme vsak uz aj vcelku unikatne a 
pozitivne vysledky, ktore predstavuju dobry zaklad na d’alsi progres. 

V prednaske teda bude zhrnuta historia otvorenych udajov v SR za ostatnych 
zhruba pat’ rokov (s dorazom na ostatny rok), aktualna situacia a tiez odhad toho, 
co by sa mohlo udiat’ v najblizsom obdobi. 

Typ prispevku: Pozvana prednaska 
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1 Uvod 

Tema otvorenych udajov (Open Data) v SR zaznamenava zmiesane reakcie a vysledky, 
vo vseobecnosti vsak napreduje: Po pripojeni sa SR k OOP a stanoveni si uloh v prvom 
Akcnom plane najma v ramci temy otvorenych udajov sa z temy, ktora bola okrajovou 
az neznamou, za par rokov stala tema bezne akceptovana. Pokial’ vsak ide o realne 
publikovanie, caka SR este mnoho prace (najma v oblasti publikovania tzv. prioritnych 
datasetov a dodrziavani etablovanych standardov), zaznamenali sme vsak uz aj vcelku 
unikatne a pozitivne vysledky, ktore predstavuju dobry zaklad na d’alsi progres. 

V prednaske teda bude zhrnuta historia otvorenych udajov v SR za ostatnych zhruba 
pat’ rokov (s dorazom na ostatny rok), aktualna situacia a tiez odhad toho, co by sa 
mohlo udiat’ v najblizsom obdobi. 

Obsah prednasky sa opiera najma o pamat’ autorov a nepredstavuje kompletny ani 
vycerpavajuci zoznam noviniek v teme otvorenych udajov. Poradie sekcii je zhruba 
chronologicky, zohl’adnujuc vsak nie len formalny zaciatok aktivit ale to, kedy boli 
pozorovane najviditel’nejsie zmeny 
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2 Uplynulych zhruba 5 rokov 

2.1 Sformovanie komunity OpenData.sk 

Komunita OpenData.sk vznikla ako neformalna iniciativa pod zastitou OZ Utopia v 
roku 2010. Zakladom pre iniciativu boli predchadzajuce aktivity d’alsich NGO: SOIT, 
Aiancia Fair-Play ci Transparency Intematinal. 


2.2 Iniciativa pre otvorene vladnutie (OGP) 

V septembri 2011 sa SR podpisom vtedajsej premierky Ivety Radicovej pripojilo k Ini- 
ciative pre otvorene vladnutie (Open Government Partnership - OGP). 

Po pripojeni SR k OGP bol prijaty Akcny plan na roky 2012-2013 1 (formou uznese- 
nia vlady c. 50/2012 2 ) v ktorom si SR ulozila 22 uloh, z coho 12 sa tykalo pristupu k 
informaciam (a teda aj problematiky otvorenych udajov). Opierajuc sa o tento akcny 
plan a tiez d’alsie existujuce zakony (zakon o slobodnom pristupe k informaciam - 
211/2000 Z.z. a zakon o informacnych systemoch verejnej spravy - 275/2006 Z.z.) sa 
v SR formalne zapocali aktivity statnej spravy v oblasti otvorenych udajov. V roku 
2012 bol napr. spusteny datovy katalog http://data.gov.sk a pomocou neho neskor zve- 
rejnenych a zdokumentovanych prvych 161 datasetov (k augustu 2013 3 ). Podrobnosti 
o plneni planu mozno najst’ napr. v dokumente „Nezavisly hodnotiaci mechanizmus: 
Slovensko: Hodnotiaca sprava 2012- 2013“ 4 5 . 

Aktivity pokracovali prijatim a nasledne plnenim druheho planu na rok 2015 s . Jed- 
nym z dolezitych vysledkov tohto planu je napr. prieskum ohl’adom prioritnych datase¬ 
tov 6 , na zaklade ktoreho boli ako prioritne datasety vyhodnotene tieto: 

1. Kataster nehnutel’nosti (UGKK) 

2. Vysledky volieb (Statisticky urad) 

3. Udaje zo scitania obyvatel’ov, obyvatel’ov, domov a bytov (Statisticky urad) 

4. Obchodny register (Ministerstvo spravodlivosti) 

5. Register adries (Ministerstvo vnutra) 

6. Zivnostensky register (Ministerstvo vnutra) 

7. Data o dopravnych nehodach (Ministerstvo vnutra, PPZ) 

8. Data o kriminalite (Ministerstvo vnutra, PPZ) 

9. Cestovne poriadky (Ministerstvo dopravy, vystavby a regionalneho rozvoja) 


1 http://www.otvorenavlada.gov.sk/fmalna-verzia-akcneho-planu/ 

2 http://www.rokovania.sk/File.aspx/ViewDocumentHtml/Uznesenie- 12358?listName=Uznese- 

nie&prefixFile=u 

3 http://www.otvorenavlada.gov.sk/hodnotenie-iniciativy-pre-otvorene-vladnutie/ 

4 http://www.opengovpartnership.org/sites/default/files/Slovakia_fmal_2012_0.pdf 

5 http://www.otvorenavlada.gov.sk/akcny-plan-na-rok-2015/ 

6 https://github.com/otvorenavlada/akcnyplan2015/tree/master/uloha-03 
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10. Postove smerovacie cisla (Ministerstvo dopravy, vystavby a regionalneho rozvoja, 
Slovenska posta) 

11. Aktualny stava znecistenie zivotneho prostredia (Ministerstvo zivotneho prostredia, 
SHMU) 

2.3 Vynos 55/2014 

Novela Vynosu o standardoch pre ISVS [c. 55/2014 Z. z.] 7 ucinna od 15.3.2014 zava- 
dza novy pojem „ otvorene udaje“ a defmuje k nim aj zakladne standardy: formaty CSV 
a JSON, protokol REST, nalezitosti ohl’adom licencovania, kvality, atd’. ako aj povin- 
nost’ zaevidovania datasetov na data.gov.sk . 

Kedze dodrziavanie Vynosu je pre verejnu spravu ulozene zakonom c. 275/2006 
Z.z., tak tato novelizacia umoznuje verejnej sprave zverejnovat’ otvorene informacie 
pine v sulade s platnou legislativou. Zmeny vo Vynose zaroven reaguju na nedostatky, 
ktore boli identifikovane pri plneni prveho akcneho planu OGP. 

2.4 Register UZ 

V roku 2014 spustilo Ministerstvo fmancii Slovenskej republiky novu verziu portalu 
„Register uctovnych zavierok“ ktoreho sucast’ou je aj verejne API 8 . Pomocou portalu 
a API mozu obcania a frrmy pristupovat’ k uctovnym informaciam slovenskych orga- 
nizacii. Sluzba je podl’a autora prezentacie unikatom, kedze: 

1. v case spustenia existovalo vo svete zrejme len jedno d’alsie podobne riesenie a to 
vo Vel’kej Britanii, pricom slovenske riesenie poskytuje omnoho podrobnejsie in¬ 
formacie nez to britske, 

2. bolo to zrejme prve oficialne Open Data API v SR, 

3. otvorene udaje z tohto portalu si nasli vel’mi rychlo vyuzitie nielen v slovenskej ne- 
ziskovej sfere (co je tradicne) ale aj v podnikatel’skej sfere (to az take obvykle nie 
je): vznikol portal FinStat.sk . 

Zaroven API Registra UZ zapocalo eru uzsej spoluprace medzi statnou spravou a Open 
Data komunitou (vid’ prezentacie o API a FinStat.SK na stretavke komunity 9 ). 


7 http://www.informatizacia.sk/standardy-is-vs/596s 

8 http://www.registeruz.sk/cruz-public/version/193084/static/api.html 

9 https://utopia.sk/wiki/display/opendata/OpenData.sk+Meetup+%233#OpenData.skMe- 

etup%233-BlokoAPIRegistraUZ 
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3 Ostany rok 

3.1 DanubeHack 

V dnoch 15. az 17.10.2015 sa v Bratislave konal Danube Open (Geo) Data Hackathon 
& Developers' Workshops, v skratke znamy ako DanubeHack 10 . Bol to prvy hackaton 
v SR ktory sa venoval otvorenym udajom a zaroven bol spoluorganizovany statnou 
spravou, konkretne Slovenskou agenturou zivotneho prostredia (SAZP) a Narodnou 
agenturou pre siet’ove a elektronicke sluzby (NASES), ktore do hackatonu pripravili aj 
nove otvorene udaje (SAZP udaje z domeny GEO a NASES udaje z Registra adries). 

Hackaton mal medzinarodnu ucast’, sut’azilo na nom 9 projektov a vlt’azi zlskali ceny 
v hodnote 3000€. Podrobnejsie informacie o vysledkoch mozno najst’ na strankach hac¬ 
katonu 11 . 


3.2 Nova verzia data.gov.sk 

Koncom roka 2015 resp. zaciatkom roka 2016 bola do prevadzky spustena nova verzia 
portalu data.gov.sk . Okrem viditel’nych zmien (nova grafika, novsia verzia softveru 
CKAN, atd’.) v sebe tento upgrade pod hlavickou projektu eDemokracia/Modul Open 
Data (MOD) prinasa aj komplexnu Open Data infrastrukturu pre statnu ale aj verejnu 
spravu, integrovanu sexistujucim Ustretnym portalom verejnej spravy (UPVS, 
http://slovensko.sk). V novej verzii totiz MOD obsahuje aj nastroje na ukladanie a pub- 
likovanie otvorenych udajov priamo na portaly data.gov.sk, transformacne a vizuali- 
zacne nastroje ako aj ucelenejsiu podporu procesov publikovania otvorenych udajov 12 . 


3.3 Zvysena aktivita Statistickeho uradu 

V roku 2015 Statisticky urad (SU) vyrazne zvysil svoje aktivity ohl’adom zverejnovania 
otvorenych udajov. Kedze poskytovanie informacii verejnosti je vlastne jednou z hlav- 
nych uloh uradu, tak t’aziskom ich aktivit ohl’adom otvorenych udajov bola a je kon- 
verzia udajov z tychto systemov do otvorenych formatov v sulade s Vynosom 55/2014, 
t.j. automatizovanie exportu DATAcube kociek do formatu CSV, vol’ba licencie CC- 
BY-SA a evidovanie datasetov na portaly data.gov.sk 13 . 

K 7.9.2016 ma na data.gov.sk SU zaevidovanych 606 datasetov z celkoveho poctu 
1002. Aktualne informacie mozno sledovat’ priamo na data.gov.sk 14 . 


10 http://www.danubehack.eu/ 

11 http://www.danubehack.eu/?#section-results 

12 https://www.nases.gov.sk/data/files/9218.pdf 

13 https://utopia.sk/wiki/display/opendata/OpenData.sk+Meetup+%236 

14 https://data.gov.sk/organization/f4787c6f-9fa3-406c-b8d5-d374flelf2d3 
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3.4 Zapojenie samospravy 

V roku 2015 zavrsilo mesto Presov svoje viac nez stvorrocne Open Data aktivity spus- 
tenim katalogu otvorenych udajov 15 a zaevidovanlm svojich datasetov na data.gov.sk 16 . 
Nasledne svoje znalosti a skusenosti zacali zdiel’at’ aj s d’alsimi mestami a tak napr. 
pociatkom tohto roku (2016) spustilo svoj katalog otvorenych udajov ako aj zverejno- 
vanie datasetov mesto Levice 17 . 

Do roku 2015 boli otvorene udaje domenou statnej spravy pricom samosprava o 
moznostiach a povinnostiach ohl’adom otvorenych udajov da sa povedat’ netusila. Od 
roku 2015 mdzeme ohl’adom otvorenych udajov zacat’ hovorit’ aj o aktlvnom zapojenl 
slovenskych samosprav. 


3.5 Register Adries ako Open Data 

Zaciatkom roka 2016 bolo do oficialnej prevadzky spustene publikovanie udajov z Re- 
gistra Adries (RA) vo forme otvorenych udajov prostrednlctvom portalu data.gov.sk 18 . 
Je to jeden z prvych pilotnych projektov publikovania udajov prostrednlctvom uz spo- 
mlnaneho Modulu Open Data (MOD) v ramci ktoreho Ministerstvo vnutra SR na za- 
klade dohody s NASES poskytuje udaje z RA prostrednlctvom „intemeho“ G2G API 
do data.gov.sk pricom prave data.gov.sk vykonava formatove a obsahove konverzie 
potrebne na to, aby udaje splnali nalezitosti Vynosu 55/2015 ohl’adom otvorenych uda¬ 
jov. 

Zverejnovanie udajov z RA je jednym z modelov spoluprace, ktory ponuka NASES 
ostatnym organizaciam verejnej spravy, kedy tieto organizacie poskytnu udaje v ne- 
zmenenej podobe a NASES zabezpecl riadne zverejnenie vo forme otvorenych udajov 
v sulade s platnymi standardami, elm sa zabezpecuje spravne a zaroven efektlvne uda¬ 
jov. 

Poznamka: Existuje viacero spravnych sposobov zverejnovania otvorenych udajov 
a tak NASES ponuka aj viacero modelov spoluprace pri zverejnovani. Na jednej strane 
je mozne zverejnovanie cisto v rezii tzv. povinnych osob (PO), kedy NASES aktivne 
nespolupracuje a poskytuje iba minimalne sluzby potrebne na evidovanie datasetov na 
data.gov.sk (vyzadovane Vynosom 55/2014). Na opacnej skale spoluprace je vyssie 
popisany model, kedy PO robi len minimalne ukony a drvivu vacsinu prace vykona 
NASES. 


3.6 Referencovatel’ny identiflkator a Linked Data vo Vynose 55/2014 

Od roku 2015 prebieha proces aktualizacie Vynosu 55/2014 ohl’adom Linked Data a 
referencovatel’nych identifikatorov. Stavia sa na novele z roku 2014 a ciel’omje doplnit’ 
existujuci standard tzv. datovych prvkov vo formate XML (priloha c. 2 k Vynosu) aj o 


15 https://utopia.sk/wiki/pages/viewpage.action7pagekD58360521 

16 https://data.gov.sk/organization/8a043e04-3ef8-45d4-a9a9-ede214e5fac5 

17 https://utopia.sk/wiki/pages/viewpage.action7pagekD61866089 

18 https://data.gov.sk/dataset?tags=register+adries 
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reprezentaciu v RDF pomocou zavedenia slovenskej ontologie pre datove prvky a ich 
napojenie na vo svete zauzlvane ontologie. Specificka slovenska ontologia je motivo- 
vana tym, aby bol prechod z XML na RDF jednoduchy (napr. pomocou jednoduchych 
XSLT transformacil). Zaroven vsak budu vyuzite vlastnosti Linked Data a Semantic 
Web tak, aby boli slovenske udaje prepojene a prepojitel’ne na udaje v zahranici. 

Aktualnejenavrhnovelyprejednavany vramciPSl Standardizacnejkomisie 19 atiez 
aj v sirsej komunite 20 . 

3.7 Akcny plan OGP 2017-2019 

Urad splnomocnenca vlady pre rozvoj obcianskej spolocnosti (USV ROS) v spolupraci 
s obcanmi pripravil d’alsl akcny plan OGP pre roky 2016 az 2019 21 . Akcny plan defi- 
nuje 69 uloh z ktorych 14 v kategoriach otvorene udaje a otvorene API. 

Ulohy v kategorii otvorenych udajov predstavuju kontinuitu k predchadzajucim planom 
a mozno ich vnlmat’ ako inkrementalne zlepsovanie zverejnovania otvorenych udajov 
v SR pricom doraz sa presuva od technickych otazok k zlepsovaniu procesov a kvality. 

Ulohy spojene s otvorenjmii API predstavuju revolucny krok vpred, vd’aka ktoremu 
by sa mal odomknut’ skryty potencial zakladnych elektronickych sluzieb verejnej 
spravy tym, ze k nim dostanu moznost’ pristupovat’ aj firmy, neziskove organizacie 
alebo jednotlivci vd’aka comu budu schopny obcanom ponuknut’ vyrazne rozslrene 
sluzby, ktore pomozu obcanom aj v situaciach, ktore pine nespadaju do posobnosti ve- 
rejnych institucii. 

Prlkladom moze byt’ kupa auta: Dnes si pomocou elektronickych sluzieb statu mdze 
obcan cez internet vybavit’ zakladne uradne ukony (prihlasenie vozidla do evidencie, 
atd’.) ale rozne d’alsie povinnosti (napr. povinne zmluvne poistenie) alebo doplnkove 
veci (napr. havarijne poistenie) si musl vybavit’ inde. V tomto prlpade moze rozslrena 
sluzba ponuknuta firmou ponuknut’ obcanovi vsetko na jednom mieste a vyrazne jed- 
noduchsie (od prihlasenia vozidla az po havarijne poistenie) - stat nieco take poskytnut’ 
nemoze (nema v kompetencii ponukat’ komercne havarijne poistenie) a vd’aka elektro- 
nickemu obcianskemu preukazu (elD) a zarucenemu elektronickemu podpisu (ZEP) 
moze byt’ riesenie stale pine bezpecne aj ked’ bude prostrednikom sukromna firma. 


3.8 Iniciativa Slovensko.Digital 

Koncom roka 2015 Slovensko dosiahlo jeden mil’nik v informatizacii verejnej spravy: 
z prostriedkov EU ako aj vlastnych prostriedkov SR bolo od roku 2007 vynalozenych 
viac ako 900 milionov € na rozne informacne systemy, z pohl’adu obcana vsak vidno 
len malo prinosnych rieseni. Takato bilancia dala dokopy niekol’ko stoviek nadsencov 


19 https://wiki.finance.gov.sk/label/PS 1/ps 1-19 

20 https://platforma.slovensko.digital/t/semanticke-datove-standardy-pre-udaje-verejnej-spravy- 

sr/185 

21 http://www.minv.sk/7ros_ogp_tvorba_2016-19&sprava=finalny-navrh-akcneho-planu-inicia- 

tivy-pre-otvorene-vladnutie-na-roky-2016-2019-zverejneny 
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z IT sektora ale aj verejnej spravy a sformovali platformu Slovensko.Digital, ktore sa 
zaciatkom roka transformovalo na oficialnu neziskovu organizaciu ktora si stanovila za 
ciel’ spolupracovat’ s verejnymi instituciami na zvyseni kvality ich digitalnych slu- 
zieb 22 . 

Jednym zo sposobov ako to dosiahnut’ je ukazovat’ pozitivne priklady, ako sa da 
robit’ statne IT dobre. Jednou z inspiracii je napr. Ekosystem. Slovensko.Digital 23 , ktory 
bol uvereny do testovacej prevadzky dna 18.8.2016 24 . Ekosystem ukazuje, ze je mozne 
prioritne otvorene udaje (Register pravnickych osob, Centralny register zmluv a d’alsie) 
publikovat’jednoducho, lacno a dobre (pomocou CSV a BitTorrent) a zaroven k nim 
jednoducho, lacno a dobre poskytovat’ aj API (na baze REST A SQL) a d’alsie dopln- 
kove sluzby (Autoform, ktory poskytuje automaticke doplnanie informacii o firmach 
pri vyplnani formularov prave na zaklade oficialnych otvorenych udajov). 

4 Aktualna situacia 

V roku 2016 nastal na oficialnej urovni mozno v otazkach otvorenych udajov konsta- 
tovat’ utlm zrejme sposobeny najprv ocakavanim volieb, neskor vysledkom volieb a 
este neskor predsednictvom SR v Rade EU. 

Prebieha napriklad presuvanie kompetencii ohl’adom informacnych systemov verej¬ 
nej spravy (ISVS) na novovytvoreny Urad podpredsedu vlady SR pre investicie a in- 
formatizaciu ktory vedie p. Peter Pellegrini 25 26 . 

V priprave je novy akcneho planu OOP pre roky 2016 az 2019 (spominany vyssie). 

A existuje aj navrh „Strategia a akcny plan spristupnenia a pouzivania otvorenych 
udajov verejnej spravy" vypracovany v NASES 27 , ktory napr. medzi zakladne ciele na- 
vrhuje nad’alej sa pridrzat’ zakladneho principu „zverejnovanie a spristupnovanie vset- 
kych dat verejnej spravy, ktore nie su utajene alebo chranene,, a „publikovanie struktu- 
rovanych datasetov". V ramci konkretnych opatreni su navrhovane doplnenia standar- 
dov ISVS v casti otvorenych udajov, zriadenie role datoveho kuratora a medzirezortny 
podporny tim pre zverejnovanie udajov. 

5 Vyhl’ad do buducna 

Ako obcania a aktivisti mozeme v otazkach zverejnovania otvorenych udajov verej¬ 
nymi instituciami poskytnut’ len odbomy odhad: Odhadujeme, ze novovytvoreny Urad 
podpredsedu vlady SR pre investicie a informatizaciu v spolupraci s obcanmi (Sloven- 


22 https://slovensko.digital/ 

23 https://ekosystem.slovensko.digital/ 

24 https://platforma.slovensko.digital/t/ekosystem-slovensko-digital-novinky/2326 

25 https://www.vicepremier.gov.sk/index.php/informatizacia/index.html 

26 https://www.vicepremier.gov.sk/index.php/o-urade/organizacna-struktura/index.html 

27 https://www.nases.gov.sk/data/files/20160122_OpenData_vpk.pdf 
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sko.Digital a d’alsimi) si ako jednu z logickych priorit v nasledujucom obdobi pri roz- 
voji informacnych systemov verejnej spravy ako jednu z priorit zvoli otvorene udaje a 
ze nastane posun v dolezitych otazkach: 

• publikovanie prioritnych datasetov na zaklade verejnych konzultacii (uloha plynuca 
z navrhu akcneho planu OOP) 

• zlepsovanie standardov zverejnovania datasetov (Vynos 55/2014) 

• vynucovanie dodrziavania standardov v existujucich ale najma novych informac¬ 
nych systemoch 

• aktivity smerujuce k realnemu ekonomickemu zhodnoteniu zverejnovanych otvore¬ 
nych udajov (uspory alebo nova pridana hodnota), kedze to je jedna z motivacii 
verejnej spravy („Prinos do rozvoja ekonomiky statu je jednym z hlavnych motivov 
EU pre spristupnovania Open Data." 28 ) ale aj motivacia komunity ci firiem. 

6 Zaver 

Tema otvorenych udajov ma v SR uz svoju historiu a aj zaujimave vysledky. Aktualna 
utlmena aktivita verejnej spravy je vyvazena novymi aktivitami obcanov a vd’aka do- 
terajsim vysledkom a aj novym aktivitam mozeme v buducnosti ocakavat’ d’alsie pozi- 
tivne vysledky. 

Pod’akovanie: Dakujeme vsetkym, ktory k publikovaniu otvorenych udajov v SR do- 
pomahali v minulosti, ci uz v ramci obcianskych iniciativ alebo „zvnutra“ verejnej 
spravy. 


Annotation: 

Current Open Data activities in Slovak Republic 

Open Data theme is received with mixed reactions and produced varying results. But in general, 
work related to Open Data is progressing forward and the topic, which was on the fringes of 
interest, become in few years commonly accepted. In terms of actual publication of Open Data, 
Slovakia has still long road ahead but noteworthy, even unique results were already achieved. 
And those achievements form a good base for good progress also in the future. 

Presentation contains summary of history of Open Data in Slovakia for past roughly 5 years (with 
main focus on activities in past year), view on current situation and estimate of what might be 
happening in near future. 


28 https://www.nases.gov.sk/data/files/20160122_OpenData_vpk.pdf 
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Abstrakt. V prednasce budou shrnuty hlavni dosazene vysledky v oblasti ote- 
vrenych a propojenych dat realizovane v uplynulych letech v CR. Dale budou 
prezentovany zamery dalsiho rozvoje oblasti otevrenych propojitelnych dat 
v CR. 
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1 Uvod 

Otevrenymi daty se od 6.9.2016, kdy podepsal novelu zakona c. 106/1999 Sb., o svo- 
bodnem prlstupu k informaclm prezident Ceske republiky, rozuml [7] ,, informace zve- 
rejhovane zpiisobem umozhujicim dalkovy pfistup v otevrenem a strojove citelnem for- 
matu, jejichz zpusob ani ucel nasledneho vyuziti neni omezen a ktere jsou evidovdny 
v narodnim katalogu otevrenych dat". 

Otevrena data se tak stala, po dvou letech intenzivnlho usill ze strany verejne spravy, 
zejmena Ministerstva vnitra, a ze strany neziskovych organizacl, soucastl pravnlho pro- 
stredl i v Ceske republice a mohou byt tak vyuzlvana jako jeden z dulezitych inovativ- 
nlch nastroju ve fungovanl verejne spravy. Dulezitost otevrenych dat je v Ceske re¬ 
publice charakterizovana i tim, ze se otevrena data dostala mezi cile v dulezitych stra- 
tegickych dokumentech Ceske republiky: 

• Strategicky ramec rozvoje verejne spravy Ceske republiky pro obdobi 2014-2020 
[9], konkretne jeho Specificky cil 3.1 - Dobudovani funkcniho ramce eGovem- 
mentu. 

• Akcni plan Ceske republiky Partnerstvi pro otevrene vladnuti na obdobi let 2016 az 
2018 [2], cast 4.2 - Zpristupneni dat a informaci. 
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• Strategic rozvoje ICT sluzeb verejne spravy a jeji opatreni na zefektivneni ICT slu¬ 
zeb schvalena usnesenim vlady c. 889/2015 [10] a jeji strategicky cil c. 5 - Od izo- 
lovanych dat kpropojenym a otevrenym datum verejne spravy a ke kvalifikovanym 
rozhodnutim vedoucim k vyssi efektivnosti sluzeb VS. 

• Akcni plan boje s korupci na rok 2016 [1] ajeho cil c. 2 - Transparentnost a otevreny 
pristup k informacim. 

• Statni politika v elektronickych komunikacich „Digitalni Cesko v. 2.0, Cesta k digi- 
talni ekonomice“ [8] a jeji cil 5.4. - Vyuzivani informaci verejneho sektoru. 

• Akcni plan pro rozvoj digitalniho trhu [3] ajeho cil Pristup k datum verejneho sek¬ 
toru v kapitole 5 : Nove trendy. 

Otevrena data se tak stala jednou z oblasti, ve ktere doslo za posledni dva roky k po- 
meme vyznamne akceleraci aktivit a vyraznemu rozsireni povedomi o ucelnosti a po- 
trebnosti otevrena data pouzivat jak na urovni statu, tak i na urovni samospravnych 
celku. 

2 Aktivity realizovane v uplynulych letech 

Myslenka otevrenych dat se v CR zacala vjrazneji prosazovat v roce 2012, kdy ceska 
vlada schvalila Akcni plan Partnerstvi pro otevrene vladnuti, ve kterem se zavazala 
mimojine zavazala pro publikaci vybranych informaci verejneho sektoru formou ote- 
vrenych dat. Nasledne se vlada zacala timto tematem zabyvat a vznikla Koncepce ka- 
talogizace otevrenych dat verejne spravy [5] a Metodika publikace otevrenych dat ve- 
rejne spravy CR [6]. Nejvyznamnejsim milnikem pro rozvoj otevrenych dat v CR byl 
vsak rok 2015, kdy se podarilo: 

• zahajit piny provoz Narodniho katalogu otevrenych dat (kveten 2015) viz 
http://data.gov.cz, 

• vytvorit standardy pro pripravu, publikaci a katalogizaci otevrenych dat verejne 
spravy CR - viz http://opendata.gov.cz, 

• vytvorit a se zastupci verejne spravy validovat vzorove publikacni plany pro jednot- 
live typy organu verejne moci, tj. centralni organy, kraje a jednotlive typy obci, 

• v}4vofit navrh uprav legislativy pro otevrena data VS CR, 

• pripravit a realizovat prvni vlnu vzdelavani v oblasti otevrenych dat, vytvorene sko- 
lici materialy byly poskytnuty vsem uradum verejne spravy a verejnosti prostrednic- 
tvim weboveho portalu MV CR. Celkem bylo proskoleno 415 osob z celkem 206 
subjektu, z toho: 10 ministerstev, 7 ostatnich ustrednich organu statni spravy, 8 kraj- 
skych uradu, 69 obci s rozsirenou pusobnosti a 112 ostatnich obecnich uradu. 

3 Aktualni deni v oblasti otevrenych dat 

V roce 2016 se prace v oblasti otevrenych dat jeste zintenzivnila. Je dokoncovan legis¬ 
lative proces, tj. na novelizovany zakon c. 106/1996 Sb. jeste navazuji prace na pri- 
prave narizeni vlady, ktere ulozi povinne zverejneni vyjmenovanych informaci jako 
otevrena data. Narizeni vlady je v soucasne dobe (zari 2016) v meziresortnim pripo- 
minkovacim rizeni. 
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V roce 2016 doslo ke zflzenl a obsazenl pozice Narodniho koordinatora otevrenych 
dat na Ministerstvu vnitra. Diky obsazenl teto pozice dochazi ke koordinaci a sjedno- 
cenl metodicke podpory pro vsechny organy verejne spravy. Dale je podporovan a roz- 
vljen narodni katalog otevrenych dat. K zacatku zari 2016 je v narodnim katalogu ote- 
vrenych dat (NKOD) zaevidovano 38963 datovych sad, z nich nejvetsi pocet datovych 
sad ma registrovan Cesky urad zememericsky a katastralni (CUZK). 


Pocet datovych sad a datovych souboru v NKOD 
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Obr. 1. Pocty registrovanych datovych sad v NKOD bez sad CUZK. 

Dale jsou v roce 2016 realizovany a pripravovany nasledujici aktivity: 

• pokracovani ve vzdelavani v oblasti otevrenych dat a vymene zkusenosti, 

• rozvoj a verejne konzultace standards (workshopy) v oblasti publikace a katalogi- 
zace otevrenych dat, zejmena ve vztahu k vyvoji mezinarodnich standards a rneto- 
dik, 

• konzultace specifickych datovych oblasti a datovych sad. 

4 Zaver 

V dalsich letech se Ministerstvo vnitra ve spolupraci s odbomou verejnosti a akademic- 
kymi institucemi zarneri na vyraznou inovaci narodniho katalogu otevrenych dat jak po 
strance uzivatelske privetivosti, kvality dat tak po strance interoperability s dalsimi ka- 
talogy otevrenych dat. Dale je zamysleno vytvoreni koncepce zasazeni otevrenych a 
propojenych dat do Narodniho architektonickeho planu, a to vcetne 

• vytvoreni datove politiky popisujici jednotny zpusob zasazeni principu otevrenych 
dat do kontextu Narodniho architektonickeho planu [4], 
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• vytvorenl koncepce semantickeho slovniku pojmu a inicializace jeho tvorby a roz- 
voje za ucelem ujednocovanl pojmoslovl a datovych struktur (na syntakticke a pre- 
devslm semanticke urovni) pouzlvanych pri publikaci otevrenych dat, 

• vytvorem predpokladu pro propojovani otevrenych dat jednotlivych organu verejne 
spravy. 

Odpovidajicim zpusobem budou rozvijeny Standardy publikace a katalogizace otevre- 
nych dat. Pri plneni techto cilu bude kladen duraz na zajisteni souladu s aktualnimi 
mezinarodnimi standardy a metodikami pro oblast otevrenych dat, predevsim tech vy- 
davanych Evropskou komisi. Dale bude nastaven proces monitorovanl stavu plneni 
koncepce jednotlivymi organy verejne spravy. 
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spravy-ceske-republiky-pro-obdobi-2014-2020 . 

10. Strategie rozvoje ICT sluzeb verejne spravy a jeji opatreni na zefektivneni ICT sluzeb. Do¬ 
stupne z: https://apps.odok.ez/zvlady/usneseni/-/usn/2015/889 . 
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Annotation: 

Open Data activities in the Czech Republic 

The objective of this paper is to summarizes the open data activities in the Czech Republic in the 
recent years. It also highlights the milestones of implementation of open data into Czech legal 
system as well as the role and use of created open data methodology and standards for publication 
and catalogisation of open data for Czech state administration. Furthermore, the paper also pre¬ 
sents planned activities of the Ministry of of the Interior of the Czech Republic in this agenda. 
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Abstrakt. Schopnost detekce a presneho rozpoznani ruznych atributu, jako jsou 
napriklad barva nebo vyrobce vozidla, hraje pomerne dulezitou roli v inteligent- 
nich dopravnich systemech (ITS), ale i pfi praci PCR, kde je tato schopnost velrni 
cenena, obzvlaste pfi detekci zajmovych nebo odcizenych vozidel. Prace se za- 
meruje na klasifikaci zminenych atributu ze snimku ziskanych z ruznych kamer 
v realnem provozu. Takoveto sniinky casto obsahuji ruzne typy deformaci, ktere 
spravnou klasifikaci vyrazne komplikuji. Soucasti prace je prakticke porovnani 
pouzitelnosti popularnich metod strojoveho uceni, rnezi ktere patri napriklad 
RandomForest, Support vector machine a nebo dnes stale oblibenejsi hluboke 
neuronove site. Provedene experimenty ukazaly, ze ackoli hluboke neuronove 
site dosahuji velrni dobrych vysledku, ne vzdy je nutne a efektivni tuto metodu 
vyuzit. 


Typ prispevku: Zvana prednaska 

Klfcova slova: strojove uceni, neuronove site, klasifikace vyrobce, barva vo¬ 
zidla, SVM, RandomForest, hluboke neuronove site 


1 Uvod 

Schopnost detekce a presneho rozpoznani ruznych atributu, jako jsou napriklad barva 
nebo vyrobce vozidla, hraje pomerne dulezitou roli v inteligentnich dopravnich syste¬ 
mech (ITS), ale i pfi praci PCR, kde je tato schopnost velmi cenena obzvlaste pfi de¬ 
tekci zajmovych nebo odcizenych vozidel. Znalost podrobnejsich informaci o deteko- 
vanem vozidle vyrazne pomaha pfi jeho zpetnem vyhledani (napfiklad pfi prujezdu 
zajmoveho vozidla), kde tyto informace mohou zpfesnit vysledek hledani. Zakladnim 
atributem pfi hledani je v obecnem pfipade registracni znacka (dale jen RZ) vozidla. 

V pfipade, zejeji detekce seize (napfiklad vlivem nesplneni vstupnich podminek po- 
uziteho algoritmu, umyslneho zastineni nebo odstraneni RZ apod.), je mozne nasled- 
nou identifikaci zjednodusit pouzitim ostatnich zjistenych atributu o vozidle. Tyto atri- 
buty navic umoznuji automatickou detekci udalosti, ktere pfi pouziti pouze registracni 
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znacky nejsou mozne, jedna se zejmena o neopravnene pouzitl jedne RZ na vice vozid- 
lech. 


2 Obsah prednasky 

V prednasce se zamefim na klasifikaci barvy a rozpoznani typu vozidla s pouzitim 
ruznych metod strojoveho uceni pro resent techto problemu. Datova sada je slozena 
ze snimku, ktere jsou porizeny z ruznych kamer nasazenych v realnem provozu. Kvalita 
takto ziskanych snimku je velmi ruznoroda a casto je ovlivnena vnejsim prostredlm, 
vlivem ktereho dochazi k ruznym deformacim porizeneho snimku. Mezi dalsi faktory 
ovlivnujici kvalitu snimku patri i ruzne natoceni kamery nebo napriklad i samotny typ 
kamery. 

Nejcastejsim zpusobem klasifikace ruznych atributu vozidla je postupna segmentace 
obrazu na zajmove oblasti, ktere jsou nasledne zpracovany a samostatne klasifikovany. 
Prvnim krokem je zpravidla detekce vozidla, rozpoznani RZ a nasledne upresneni po- 
zice vozidla. Znalost pozice a uhlu natoceni registracni znacky muze byt vyuzita pro 
implementaci algoritmu, ktery efektivne provede vyber zajmove oblasti. Ze ziskane 
zajmove oblasti jsou nasledne extrahovany informace, ktere jsou pote klasifikovany 
pomoci metod strojoveho uceni. 

V prubehu prednasky bude predvedeno prakticke porovnani pouzitelnosti popular- 
nich metod strojoveho uceni, mezi ktere patri RandomForest, Support vector machine 
(SVM) a dnes stale populamejsi hluboke neuronove site. 

Tyto vybrane metody strojoveho uceni jsou aplikovany na snimky ziskane z vice jak 
50-ti kamer. 

Na zaver budou shmuty vyhody a nevyhody jednotlivych metod, vcetne ziskanych 
zkusenosti z realneho provozu. 
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Annotation: 

Automatic recognition of vehicle attributes using machine learning 

Detection and recognition of vehicle attributes is important part of the ITS systems, particularly 
when searching for interest or stolen vehicles. The lecture discusses applicability of machine 
learning methods, including RandomForest, Support vector machine and deep neural networks, 
in identifying the individual vehicle attributes based on camera images from the real environ¬ 
ment. The results from functional implementation of machine learning algorithms for classifica¬ 
tion of color and vehicle make, deployed in real-world environment for the purpose of Police 
of the Czech Republic, will be presented. 
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Abstrakt. Vel’ke medzinarodne spolocnosti ako Google, Microsoft alebo Face- 
book investuju nemale prostriedky do podpory vyskumu a rozvoja oblasti hlbo¬ 
kych neuronovych sieti. Nedavne porazenie sampiona v hre Go, prave vd’aka hl- 
bokemu uceniu, ukazuje potencial tohto pristupu. Aplikacie strojoveho ucenia 
zalozene na hlbokych umelych neuronovych siet’ach dosahuju v mnohych oblas- 
tiach lepsie vysledky ako pristupy zalozene na rucne ladenych crtach. V tejto 
pozvanej prednaske si prejdeme zakladne principy hlbokeho ucenia a ukazeme si 
aplikaciu tohto pristupu na rozne problemy, ktore riesime na UISI FIIT STU. 

Typ prispevku: Pozvana prednaska 

Kl’iicove slova: neuronove siete, strojove ucenie 


1 Uvod 

Popularita umelych neuronovych sieti v domene strojoveho ucenia v poslednom case 
vyznamne vzrastla. Je to hlavne vd’aka novym uspesne pouzitym pristupom ucenia a ar- 
chitekturam neuronovych sieti, ako aj dostupnost’ou masivne paralelnych vypoctovych 
prostriedkov na trenovanie sieti (cipov grafickych kariet). Zasluhu na tom ma aj to, ze 
vel’ke medzinarodne spolocnosti ako Google, Microsoft alebo Facebook investuju ne¬ 
male prostriedky do podpory vyskumu a rozvoja v tejto oblasti. Tento prispevok opi- 
suje kratke predstavene hlbokych sieti a aplikacie tohto pristupu na rozne problemy, 
ktore riesime na UISI FIIT STU. 

2 Od perceptronu ku hlbokej neuronovej sieti 

Klasifikacia objektov, ako napriklad rozpoznavanie vzorov, medicinskych diagnoz 
a pod. je vel’mi frekventovany problem rieseny v domene strojoveho ucenia. Pri tychto 
ulohach mame mnozinu oznacenych objektov (vektorov vlastnosti, co mozu byt’ body 
obrazku, alebo vysledky testov pacienta). Ku kazdemu objektu mame pridelenu aj jeho 
kategoriu, pes, macka, alebo chripka, angina. 
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Zakladnym stavebnym blokom neuronovej siete je neuron a jednym z prvych pristu¬ 
pov ucenia bolo pouzite perceptronu. Perceptron je v najjednoduchsej podobe binamy 
klasifikator, ktory mapuje vstupy x na vystupne hodnoty 0 alebo 1 podl’a: 

, , , rl : ak w. x + b > 0 
hW = to :inak 

kde w je vektor vah a b je prahova hodnota. Perceptron sa trenuje pomocou zmeny 
vah tak, ze pre kazdy vstupny vektor z trenovacej mnoziny sa vypocita vystup per¬ 
ceptronu a vahy sa upravuju tak, aby sa minimalizovala chyba klasifikacie. 

Nevyhodou samostatneho perceptronu, je ze je schopny naucit’ sa len linearnre se- 
parovatel’ne problemy. Preto napriklad nie je schopny aproximovat’ funkciu XOR. Rie- 
senim tohoto problemu je vytvorenie viacvrstvovej doprednej siete, co je vlastne len 
pospajanie viacerych perceptronov do vrstiev. Najjednoduchsou viacrstvovou siet’ou je 
dopredna trojvrstvova siet’, kde prva vrstva predstavuje vstup siete, druha vrstva, ktora 
sa nazyva aj skrytou obsahuje perceptrony ktore spracovavaju aktivacie zo vstupov 
a vystupna (tretia) vrstva spracovava informacie zo skrytej vrstvy. Prepojenie medzi 
vrstvami je typicky uplne, perceptron z vrstvy spracovava aktivacie vsetkych neuronov 
v predchadzaj ucej vrstve. Kazde prepojene medzi neuronmi je defmovane vahou vv. 
Spracovanie vstupnych dat prebieha po vrstvach. Na vstup siete prezentujeme vstupne 
data ktore sa presiria kneuronom skrytej vrstvy, ktore ich spracuju ako prechodovu 
funkciu vazenej sumy vstupov a prisluchajucich vah. Je dolezite aby prechodova funk- 
cia nebola lineama, preto sa pouziva napriklad sigmoidalna funkcia, tanh. Dalsie vrstvy 
pracuju analogicky, teda spracovavaju vzdy vystupy predchadzaj ucej vrstvy. Vystu- 
pom neuronovej siete je vystup poslednej vrstvy. 

Na trenovanie viacvrstvovej neuronovej siete sa moze pouzit’ ucenie so spatnym si- 
renim chyby [6], ktore je zalozene na vypocitani chyby na vystupe neuronovej siete 
a spatnej propagacii tejto chyby cez neuronovu siet’s upravou vah. 

Neuronova siet’s jednou skrytou vrstvou je univerzalny aproximator funkcie [4], 
Problemom ale je, ze pri realnych problemoch je pre dosiahnutie pozadovanej presnosti 
potrebne pouzit’ vel’mi vel’ku skrytu vrstvu. Vel’kost’ skrytej vrstvy ale prinasa narast 
poctu vah siete, pre ktore treba trenovanim najst’ optimalne hodnoty, co zvysuje vypoc- 
tovu narocnost’ ako aj poziadavky na vel’kost’ trenovacej mnoziny. Pri vel’kych neuro- 
novych siet’ach hrozi riziko pretrenovania, kedy sa neuronova siet’ nauci trenovacie 
vzory ako keby naspamat’ a strati schopnost’ generalizacie. Ukazuje sa, ze hlboke neu- 
ronove site s vacsim poctom skrytych vrstiev dokazu najst’ riesenie efektivnejsie s men- 
sim poctom vah. 

Zvacsovanie poctu skrytych vrstiev prinasa dva problemy, prvym je moznost’ pre- 
trenovania a druhym je „vj4racajuci sa gradient", ktory pri pouziti spatneho sirenia 
chyby sposobi, ze sa hodnota gradientu zmensuje prechodom cez vrstvy a potom zmena 
vah v nizsich vrstvach je minimalna. 

Dobrym sposobom ako tymto problemom celif je predtrenovanie vrstiev siete s po- 
uzitim autoenkoderov a RBM (restricted Boltzman Machine) sieti [2], Pri trenovani ne¬ 
uronovej siete vznika na skrytej vrstve vnutoma reprezentacia vstupu. Tento fakt sa da 
pouzit’ pri trenovani autoenkoderov - sieti, ktore maju rovnaky pocet neuronov na 



55 Pozvana predndska 


vstupnej aj vystupnej vrstve. Spravidla skryta vrstva obsahuje mensi pocet neuronov. 
Pri trenovani sa na vstupe prezentuju vzory z trenovacej mnoziny a na vystupe pozadu- 
jeme ten isty vzor. Nutime teda siet’, aby si na skrytej vrstve vytvorila reprezentaciu 
vstupu, ktora ale vzhl’adom na mensi pocet neuronov je v redukovanej dimenzii a teda 
siet’ musi vyberaf len podstatne crty. Tymto sposobom sa daju postupne natrenovat’ 
hlbsie siete, kedy pridavame d’alsie autoenkodery a spajame ich vo vrstvach za sebou. 
Ked’ze autoenkodery vytvaraju len vnutornu reprezentaciu dat, poslednou vrstvou pri 
siet’ach typu vrstvenych autoenkoderov (Stacked Autoencoder) je normalna vystupna 
vrstva doprednej siete a cela hlboka siet’ sa dotrenuje pristupom spatneho sirenia chyby. 

Boltzmanove stroje su neuronove siete schopne ucenia sa bez ucitel’a, cize dokazu 
najst’ vzory v mnozstve vstupnych dat bez toho, aby potrebovali vediet’, aky ma byt’ 
vystup. Vytvoria vlastne lepsiu reprezentaciu dat na vyssej urovni abstrakcie. Proble- 
mom Boltzmanovych strojov vsak je, ze aj ked’ maju vel’mi dobre prepracovanu teoriu, 
v praxi nie su vel’mi dobre pouzitel’ne, pretoze ich ucenie je prilis pomale. Existuje vsak 
viacero sposobov upravy architektury Boltzmanovho stroja a jeho procesu ucenia, ktore 
vyrazne zefektivnia jeho ucenie a vd’aka tomu su vel’mi dobre pouzitel’ne v praxi. Jed- 
nou z takychto uprav Boltzmanovho stroja je Restricted Boltzmann Machine (RBM), 
co znamena Boltzmannov stroj s obmedzenim. Toto obmedzenie spociva v tom, ze 
RBM nema ziadne prepojenia medzi dvomi skrytymi neuronmi a dvomi viditel’nymi 
neuronmi. Jeho architekturou je teda bipartitny graf, kde su skryta a viditel’na vrstva 
neuronov vzajomne kompletne prepojene, no v ramci samotnej vrstvy nie su ziadne 
prepojenia. Tak ako pri autoenkoderoch, mozeme aj RBM poskladat’ na seba a vnikne 
nam tzv. Deep Belief Network. 

3 Priklady pouzitia hlbokych sieti 

Jednou z oblasti v ktorych sme skumali moznosti uplatnenia neuronovych sieti su ulohy 
strojoveho ucenia nad sekvencnymi datami. Jeden zo scenarov, na ktore sme sa zame- 
rali je hl’adanie lepsej reprezentacie sekvencie dat zo zariadenia na snimanie pohl’adu 
(eye-tracker). Pouzili sme siet’ typu Restricted Boltzmann Machine (RBM), ktorej sme 
postupne ukazovali vizualnu reprezentaciu pouzivatel’ovho sedenia pred eye-trackerom 
vpodobe teplotnych map, ktore zachytavali priestorovu aj casovu informaciu o pohl’ade 
pouzivatel’a [ 1 ]. RBM siet’ bola schopna najst’ take crty, ktore poskytli vhodnu, abstrak- 
tnejsiu reprezentaciu jedneho sedenia pouzivatel’a, ktoru sme uspesne pouzili pre d’alsie 
ulohy strojoveho ucenia (segmentaciu aktivity pocas sedenia). 

Skumali sme aj rekurentne neuronove siete: pouzitie LSTM siete [3] pre ucely pre- 
dikcie konverzie citatel’ov clankov na webe s integrovanym platobnym systemom na 
zaklade standardnych zaznamov o pristupoch pouzivatel’ov k obsahu. Skumali sme 
vplyv architektur, techniky odstavenia neuronov, miesania dat, vstupnych udajov a ak- 
tivovania brany resetu na vykonnost’ nasho modelu, ktory sme porovnavali s nahodnym 
lesom postavenym nad l’ud’mi zvolenymi crtami. Ukazalo sa, ze nasa architektura so 
zapojenim LSTM siete predikuje lepsie vysledky ako nahodny les (podl’a F-metriky 
17 % vs. 46 %), pricom nedosahuje rovnaku presnost’ ako nahodny les, ale poraza ho v 
uplnosti. 
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V ramci vyskumu extrakcie diskriminacnych kl’ucovych slov [8] sme skumali aj 
moznost’ vytvorenia architektury neuronovej siete, ktora by bola idealna pre tuto ulohu. 
Vacsina architektur neuronovych sieti na extrakciu kl’ucovych slov bola doposial’ stan- 
dardne navrhovana tak, aby sme siet’ natrenovali s ucitel’om [9], Pri takomto uceni s 
ucitel’om mame k dispozlcii mnozinu textovych dokumentov, pre ktore su zname ich 
kl’ucove slova. My sme sa vsak zamerali na pokrocilejsie techniky navrhu architektur, 
ktore by nevyzadovali dokumenty so znamymi kl’ucovymi slovami. Namiesto toho sa 
zameriavame na ulohu kategorizacie dokumentov, v ramci ktorej nas zaujlmaju diskri- 
minacne kl’ucove slova, t.j. take, ktore maju dobru rozlisovaciu schopnost’ medzi da- 
nymi kategoriami. Inspirovanl Inception modulom [7] sme sa zamerali na navrh uni- 
verzalneho modulu pre extrakciu diskriminacnych kl’ucovych slov. Zakladnou mys- 
lienkou architektury nasej neuronovej siete je modelovanie kl’ucovych slov na medzi- 
vrstve, nie na vystupnej vrstve. Funkciou (v nasom pripade dvoch) vystupnych vrstiev 
je poskytnut’ spatnu vazbu pre medzivrstvu kl’ucovych slov. Jedna vystupna vrstva za- 
bezpecuje, aby crty medzivrstvy klucovych slov reprezentovali crty skutocnych slov, 
ktore sa nachadzaju v prislusnom dokumente. Druha vystupna vrstva zabezpecuje, aby 
crty medzivrstvy kl’ucovych slov boli diskriminacne, co ma tiez pozitivny vedl’ajsi 
efekt, kedze na tejto vystupnej vrstve vidime pravdepodobnosti zaradenia dokumentu 
do jednotlivych kategorii. Navrhnutu architekturu sa nam podarilo uspesne skombino- 
vat’s viacerymi standardnjuni architekturami neuronovych sieti ako su konvolucne [5] 
a LSTM siete. 

4 Zaver 

V nasom prispevku sme v kratkosti predstavili niektore modely hlbokych neuronovych 
sieti, ako aj ich vyuzitie pri roznych ulohach strojoveho ucenia. Sucasnym trendom je 
kombinovanie roznych typov sieti do vrstiev naozaj hlbokych a komplexnych architek¬ 
tur neuronovych sieti. Existuje mnozstvo ramcov (angl. frameworkov) ako napr. Ten- 
sorFlow 1 , Torch 2 alebo Theano 3 , ktore umoznuju jednoduchu tvorbu takto vrstvenych 
sieti a ich trenovanie cipoch grafickych kariet. Myslime si, ze takjunto kombinovanim 
a striedanim roznych typov sieti vo vrstvach hlbokych sieti, bude mozne dosahovat’ este 
lepsie vysledky pri ulohach napodobnujucich l’udsku inteligenciu. 

Pod’akovanie: Tato publikacia vznikla vd’aka ciastocnej podpore projektov: Prisposo- 
bovanie pristupu k informacnym a vedomostnym artefaktom zalozene na interakciach 
a kolaboracii v prostredi webu (VG 1/0646/15), Inteligentna analyza vel’kych udajo- 
vych korpusov semanticky-orientovanjmi a bio-inspirovanymi metodami vparalelnom 
prostredi (VG 1/0752/14), Informacne spravanie sa cloveka v digitalnom priestore 
(APW-15-0508) a projektu v ramci OP Vyskum a vyvoj pre projekt: Medzinarodne 


1 TensorFlow - Open Source library for Machine Intelligence , https://www.tensorflow.org/ 

2 Torch - scientific computing on GPUs, http://torch.ch/ 

3 Theano - python library for deep learning, http://deeplearning.net/software/theano/ 
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centrum excelentnosti pre vyskum inteligentnych a bezpecnych informacnokomunikac- 
nych technologii a systemov, ITMS 26240120039, spolufinancovany zo zdrojov Eu- 
ropskeho fondu regionalneho rozvoja. 
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Abstract. Both Python and R are popular programming languages for data ana¬ 
lysis. While R’s functionality is developed with statisticians in mind. Python is 
often praised for its easy-to-understand syntax. In this paper, we will highlight 
some of the differences between R and Python, and how they both have a place 
in the data science world. 

Paper type: Invited talk 
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1 Introduction 

R or Python, which to choose? Which is better? These questions worry researchers for 
a long while without clear or obvious result. As data science has many faces, there exist 
fields where is more suitable to use R and other use cases, where Python is the language 
to choose. The aim of our paper is not to create another paper which ends with a diplo¬ 
matic tie. We rather focus on description of practical experiences and giving the advices 
how to choose language for concrete researcher and research task. 

2 Language characteristics 

If you take a look at communities of both languages, their activity on github and dis¬ 
cussion forums, you find that both of them are numerous and active. The R is popular 
mainly in academic environment and it is slowly moving into the enterprise sphere. 
Python community is, on the other hand, formed mostly by engineers, programmers 
and hackers, who moved into the data analysis field. Nowadays it appears that R is for 
a head in advance in fields like time series analysis, econometrics, robust statistical 
methods, bayesian statistics and machine learning, but in our opinion it is mostly by a 
historical reasons. Few years back (and also today in some extent), R had more pack¬ 
ages for statistical analysis, data processing and machine learning than Python. 

However, by actively borrowing the good parts from various mathematical lan¬ 
guages, Python library offer is growing rapidly. It introduced the support for data frame 
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manipulation from R into Python in the Pandas library, numerical analysis and matrix 
operation best known from languages as Matlab or Octave in library Numpy, many 
machine learning algorithms in scikit-leam and many more. These libraries are actively 
developed and rapidly growing in an organized way as they can build on years of ex¬ 
perience from other, more mature languages. 

An argument in favor of R is that even if the Python is easy to learn, you need to get 
your hand dirty and write more interlinking code between various parts from these li¬ 
braries. In R, you follow the typical workflow optimized for typical steps and the anal¬ 
ysis is straightforward, generally just in a few lines of code. Hence, the R language is 
very easy to use. However, this can often come back to bite you, if you have to do 
something atypical or create something never done before. On the other hand Python 
offers full-valued alternative for the data science and it brings advantages of program¬ 
ming language with uniform syntax, high code readability and testability. 

3 Which language fits my research the best 

The process of selection between R and Python should subject mainly to the task, it will 
be used for. The closer the task is to the pure mathematics and research, the more suit¬ 
able it is to use the R language with its data frames, matrix data processing, machine 
learning, statistical testing and visualization packages. On the other side, in the case of 
more real world problems with need to work with messy data from multiple sources 
and when applying research result into production, Python is more suitable. 

In case you are choosing the suitable language for your research, you should consider 
what is your aim. If you are using big amounts of data, you need to easily evaluate 
various algorithms and the result of your research will be an analysis or research paper, 
R is a logical choice. You can achieve this result without the need of extensive coding, 
testing and creation of production application. Y ou simply use existing libraries for data 
manipulation, analysis and visualization and build the script in one go. 

On the other side, if you need to collect the data at first, crawl various web sources 
with different structure, process the data and then create a production service from it, 
you should choose rather Python. It is also more suitable for projects where bigger 
codebase will be built, maintained and read multiple times, potentially by different peo¬ 
ple. The main aim of Python language authors was the code beauty and readability and 
as the codebase grows, it tends to be much more maintainable when it is written in 
Python than in R. The R language is more suitable for individual analyses or algorithms 
and not so much for complex systems. 

4 Typical data science workflow 

The typical workflow when using the R language closely follow the typical progress of 
any data analysis. You load the data, preprocess it, visualize, learn the model, visualize 
it, its results and statistically evaluate its accuracy. Of course, this is a very limited 
characteristic and one can find other modes of application of R spiraling out of this 
typical one. 
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In the case of Python, you can not form such stable workflow as it was designed as 
general purpose language and it uses are really broad. It can be used in the data collec¬ 
tion phase, in data transformation in novel manners, it can be used in the typical data 
analysis workflow (even though, the typical roads are not so well-worn and you need 
more interlinking code to join various libraries), but it can be also used to build complex 
systems and production applications employing data analysis results. In general, Python 
is very flexible and you, as a data scientist, can benefit from this when doing something 
novel that has never been done before. 

5 Conclusions 

The more tools you have at your disposal the more effectively you can do the job at 
hand. So the best choice is to learn both and choose the most suitable tool for the prob¬ 
lem you are currently dealing with. 

Another good advice it to master yourself in some language. It can be anything - R, 
Python or even Java. Only if you know some language well, you are able to use it 
effectively and benefit from its features and strengths. 

Both are getting better at doing what the other does well though. We already noted 
that Python has Pandas to mimic R functionality. R has a web application framework 
called Shiny. There are libraries to use R with Python, and vice versa. We'd just recom¬ 
mend using both. When you need something that is general purpose Python is better. 
When you just need to do data analysis or answer a question, R is better. 

Acknowledgment: This work was partially supported by the Research and Development 
Operational Programme for the project International Centre of Excellence for Research 
of Intelligent and Secure Information-Communication Technologies and Systems, 
ITMS 26240120039, the Scientific Grant Agency of The Slovak Republic, grant No. 
VG 1/0752/14 and VG 1/0646/15, the Slovak Research and Development Agency un¬ 
der the contract No. APVV-15-0508 and the Cultural and Educational Grant Agency of 
the Slovak Republic, grant No. KEGA 009STU-4/2014. 




Analyza dat, 
dolovanie v datach 
a strojove ucenie 




Course Similarity Analysis 


Hana Bydzovska 

CSU and KD Lab Faculty of Informatics 
Masaryk University, Brno 

bydzovska@f i .muni . cz 


Abstract. Courses offered to students at universities have different characteris¬ 
tics. In this paper, we analyse course similarities to improve the students’ perfor¬ 
mance prediction. We utilize the item-to-item collaborative fdtering approach 
that computes course similarities based on students’ grades. We also use content 
based techniques to compute course similarities based on the information from 
the course catalogue, e.g. the course content or prerequisites. Using the computed 
similarities and utilizing different clustering algorithms, we are able to reveal in¬ 
teresting course groups that can be used to improve the student performance pre¬ 
diction. Finally, we are able to predict the students’ final grades of the investi¬ 
gated course by examining grades of only three related courses. 
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1 Introduction 

The problem of the student grade prediction in a particular course has recently been 
addressed using data mining techniques. Researchers usually examine study-related 
records, e.g. the age, the gender, and the field of study [5] because of their easy availa¬ 
bility in university information systems. Moreover, they attempt to identify additional 
characteristics that can lead to better understanding of students' behaviour, e.g. their 
habits [3] or parents' education [6], The most typical way how to obtain such data is to 
conduct questionnaires. We cannot rely on data obtained by questionnaires since they 
tend to have a lower response rate. Therefore, only the data originated from the Infor¬ 
mation System of Masaryk University are employed for our experiments. Our approach 
is based on recommender system techniques [2] applied to the educational context. We 
mapped the users-item-rating problem to the student-course-grade problem and predict 
the final grades based on previous achievements of similar students. We also succeeded 
in identifying course dependencies. Finally, we were able to predict the final grades of 
the investigated course by examining grades of only 3 other courses. 
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2 Course Similarity 

2.1 Students’ Grades 

The collaborative filtering item-to-item approach from the recommender system theory 
was utilized. We mapped the users-item-rating problem to the student-course-grade 
problem [1], The first step was to construct a matrix G where rows represented students 
and columns represented courses. Grades formed the matrix. If a student did not attend 
a particular course, the corresponding cell remained empty. The adjusted cosine meas¬ 
ure is then calculated from the previously defined matrix G for each pair of courses. 

2.2 Course Characteristics 

Students search for useful information about courses in the Course Catalogue that help 
them to decide whether or not they should enrol the course. We selected different course 
characteristics and attempted to identify dependencies among courses [ 1 ]. Similarity of 
courses was defined by the weighted sum of the similarities of the selected course char¬ 
acteristics: prerequisites, literature, course content represented by the text about the 
study subject and outline, teachers, and course supervisor. 

3 Course Clustering 

Subsequently, we could construct a similarity matrix for each previously mentioned 
approach where rows and columns represented courses. The value defined similarity 
among courses formed the matrix. For both matrices, we utilized three different clus¬ 
tering algorithms to create course clusters: k-mean, spectral clustering and average link 
clustering [4], The resulted clusters defined the groups of similar courses. 

For each clustering settings, we also computed Davies-Bouldin index and Dunn in¬ 
dex to assign the best score to the algorithms with their settings that produces clusters 
with high similarity within a cluster and low similarity between clusters. 

To be able to analyse created course groups, we designed application (see Figure 1) 
that allows us to visualize course groups. The application can also help university man¬ 
agement to revise course characteristics, their difficulties, or location in course tem¬ 
plates that define students study plans. 

4 Student Performance Prediction 

For grade prediction, we utilize collaborative filtering user-to-user approach. We used 
matrix G defined in Section 2.1 and students’ similarity was calculated by Pearson’s 
Correlation Coefficient. Finally, when we predicted the students' grades of a certain 
course, we reduced the computations to the grades obtained from courses belonging to 
the same cluster as the investigated course. 
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Fig. 2. Application for Faculty Management. 

In this section, we focused on clusters obtained by hierarchical clustering algorithm [4]. 
In comparison with the method using all grades, both approaches (similarity using 
grades: SCi, similarity using course characteristics: SC 2 ) had positive effects on the 
number of calculations (see Table 1). 123 courses from all 138 belonged to some of the 
created clusters and the final grades could be predicted based on the grades of only 3 
other courses on average. 70 of our investigated courses belonged to different clusters 
using SCi and SC 2 . A slightly better MAE was obtained by the method utilizing the 
course characteristics for these courses. Therefore, when a grade is predicted, the cor¬ 
responding course is searched in SC 2 , then SCi. 


Table 1. Comparison of SCi and SC 2 


Method 

MAE 

Sensitivity 

Number of 

clusters 

Average 
cluster size 

Shared 

Courses 

All gra¬ 
des 

0.687 

0.402 

1 

499 

10 

SCi 

0.681 

0.390 

37 

3 

3 

SC 2 

0.640 

0.386 

36 

3 

2 
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5 Conclusion 

In this paper, we focused on the problem of predicting final grades of students. Our 
approach utilized recommender system techniques and predicted grades based on the 
similarity of students' achievements. Each university information system stores the data 
about students’ grades which were needed for the prediction. We also succeeded in 
identifying course dependencies. Finally, we were able to predict the final grades of the 
investigated course by examining grades of only 3 other courses. 

Once we have a reliable performance prediction, it can be used in many contexts: 
for identifying weak students, for guiding the adaptive behaviour in intelligent tutoring 
systems, or for providing a feedbackto students. We are interested in designing a course 
enrolment recommender system that will help students with selecting courses to enrol 
in. 
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Abstrakt. V clanku prezentujeme experiment, ktory sme vykonali v ramci pro- 
jektu venujucemu sa bezpecnosti mobilnych zariadeni. V experimente sme sa po- 
kusili aplikovat’ metodu modelovania j azyka pomocou n-gramov na domenu bez¬ 
pecnosti mobilnych zariadeni. Ciel’om bolo zistit’, ci je mozne pouzif n-gram 
modely vytvorene zo zaznamov udalosti mobilnych zariadeni na odhal’ovanie ne¬ 
bezpecnych udalosti a ret’azcov udalosti. 

Typ prfspevku: Vyskumny prispevok 

Kl’iicove slova: bezpecnost’, n-gramy, modelovanie jazyka 


1 Uvod 

Predmetom nasich experimentov bolo preskiimanie vyuzitia metod pravdepodobnost- 
neho modelovania jazyka (v angl. literature ako Probabilistic Language Modelling) 
v domene bezpecnosti mobilnych zariadeni. V experimentoch sme namiesto postup- 
nosti slov jazyka modelovali postupnosti udalosti zachytenych v zaznamoch mobilnych 
zariadeni (telefony a tablety so systemom Android). Tak ako v prirodzenom jazyku, tak 
aj v zaznamoch udalosti sme predpokladali urcitu zavislost’ nasledujucej udalosti od 
predchadzajiicich udalosti. V prirodzenom jazyku tato vlastnost’ predstavuje seman- 
tiku, kde urcita postupnost’ slov dava nejaky vyznam. V zaznamoch udalosti sme pred¬ 
pokladali, ze to je cast’ nejakeho procesu pozostavajuceho z urcitych akcii, ktore su 
zaznamenane ako sled udalosti. Nasou snahou bolo vytvorit’ modely nebezpecnych re¬ 
t’azcov udalosti zo zaznamov udalosti a vyuzit’ vytvorene modely na detekciu podozri- 
vej aktivity v mobilnych zariadeniach. V experimente sme aplikovali modely na vzorky 
udalosti s podozrivou aktivitou ako aj bez nej. Vysledky sme vyhodnotili metrikami 
perplexity a logaritmickej pravdepodobnosti (pravdepodobnost’, s akou sa testovana 
vzorka podoba na vzorky z trenovacej mnoziny modelu). 
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2 Prehl’ad sucasneho stavu 

Podl’a nasich zisteni sme nenasli literaturu, ktora by sa zaoberala vyuzitim metod prav- 
depodobnostneho modelovania jazyka na odhal’ovanie podozrivych ret’azcov zo zazna- 
mov udalosti mobilnych zariadeni. Podobne pristupy vsak mozno najst’ v praci [1], kde 
autori testovali presnost’ detekcie podozrivych casti kodu pomocou n-gram modelov. 
Predbezne vysledky ukazovali 98% presnost’ detekcie pri 3-nasobnej krizovej validacii 
na datasete pozostavanjucom zo 65 podozrivych programov ziskanych z emailovej ko- 
munikacie. V d’alsej pribuznej praci [3] sa autori zaoberali vyuzitim n-gram modelov 
na rozpoznavanie neznamych malware v suvislosti s metodou signature-based de¬ 
tection. Ich vysledky ukazali, ze n-gram modely dokazu detekovat’ aj nezname vzorky 
kodu. 

3 Modelovanie prirodzeneho jazyka 

Pravdepodobnostne modelovanie jazyka je zname z oblasti spracovania prirodzeneho 
jazyka. Casto sa vyuziva na riesenie roznych uloh ako napriklad: a) detekcia jazyka, b) 
automaticka korekcia chyb v texte, c) predikcia pri pisani textu, d) strojovy preklad a e) 
rozpoznavanie pisaneho textu. Princip spociva vo vytvoreni pravdepodobnostneho mo- 
delu, ktory reprezentuje rozdelenie pravdepodobnosti vsetkych moznych ret’azcov slov 
daneho jazyka. Na zaklade rozdelenia pravdepodobnosti je mozne urcit’ mieru prislus- 
nosti vstupneho textu k namodelovanemu jazyku, pricom sa beru do uvahy zavislosti 
po sebe iducich slov v ret’azcoch. Tradicne sa na modelovanie jazyka pouziva metoda 
n-gramov, teda n-tic slov, ktore mozno pravidlami modelovaneho jazyka vytvorit’. N- 
gramy sa na vytvorenie modelu ziskaju z trenovacieho textu/trenovacej mnoziny. Pod 
pojmom model jazyka rozumieme rozdelenie pravdepodobnosti nad ret’azcami treno- 
vacej mnoziny, pricom model vyjadruje pravdepodobnost’s akou vstupny ret’azec pred- 
stavuje vetu modelovaneho jazyka. Napr., pravdepodobnost’ ret’azca r dlzky d pozos- 
tavajuceho zo slov r 1 r 2 ... r d = rf = r mozeme vyjadrit’ pomocou vzt’ahu (1). 

P(r) =U? =1 P(r i \r 1 ...r i _i) (1) 

Vyhodnejsie je pravdepodobnost’ aproximovat’ tak, ze pravdepodobnost’ nasledujuceho 
slova zavisi od slova alebo ret’azca slov pred mm. Podl’a toho stupnujeme aj modely. 
Napr. bi-gram model aproximuje pravdepodobost’ nasledujuceho slova na zaklade 
predchadzajuceho slova; vzt’ah (2). Tri-gram model zasa na zaklade dvojice predcha- 
dzajucich slov; vzt’ah (3). Analogicky takto aproximujeme pravdepodobnost’ aj pre mo¬ 
dely vyssieho stupfla. 


P(r) * nti^(nln-i) 

(2) 

P(r) ~ nf=i P ( r i 1 r;_ 2 r;_i) 

(3) 


Dolezitym krokom pri vytvarani n-gram modelu je odhad pravdepodobnosti pre zname 
n-gramy trenovacieho datasetu. Najjednoduchsim sposobom je odhad pomocou frek- 
vencie n-gramov v datasete (v angl. literature Maximum-likelihood estimate). Proble- 
mom pristupu n-gramov vsak je, ze najlepsie funguju vtedy, akje testovacia mnozina 
podobna tej trenovacej. V praxi to vsak byva niekedy problem. Ked’ze trenovaci dataset 



71 Vyskumny prispevok 


nemoze pokryt’ vsetky mozne n-gramy, nezname n-gramy dostanu pri takomto odhade 
nulo vu pravdepodobnost’ - problem riedkeho datasetu. Preto sa moze stat’, ze ak sa nam 
neznamy n-gram objavi v testovacom datasete, tak mu model priradl nulovu pravdepo- 
dobnost’ a vyhodnotl ze tento n-gram nepatrl do modelovaneho jazyka. Tento problem 
sa riesi vyhladzovanlm odhadu pravdepodobnosti s ciel’om priradit’ neznamym n-gra- 
mom urcitu malu pravdepodobnost’. Na vyhladzovanie pravdepodobnosti pozname nie- 
kol’ko metod [2]: 

a) Add-one smoothing, b) linear interpolation, c) Good-Turing smoothing, d) Jeli- 
nek-Mercer smoothing, e) Katz (backoff), f) Witten-Bell smoothing, g) Absolute dis¬ 
counting a h) Kneser-Ney smoothing. Pre potreby nasich experimentov sme zvolili al- 
goritmus Knesser-Ney, ktory sa najlepsie osvedcil v domene modelovania prirodze- 
neho jazyka. Pouzili sme povodnu verziu s interpolaciou. 

4 Data 

Data vo forme zaznamov udalosti pochadzali z 18 mobilnych zariadeni pouzivanych 
testovacimi subjektami a boli zbierane pocas obdobia troch mesiacov. Zozbierane za- 
znamy vo forme vektorov s atributmi sme spracovavali na distribuovanom ulozisku 
tvorenom Hadoop klastrom s nastrojmi Apache Pig a Apache Hive. Zo zariadeni sa 
zbierali data o a) uskutocnenych hovoroch (CALLS), b) SMS komunikacii (SMS), c) 
systemovych volaniach (INTENT RECEIVED), d) informaciach o spustenych proce- 
soch (PROCESSES), e) siet’ovej komunikacii (CONNECTIONS) a f) historii webo- 
veho prehliadaca (BROWSER HISTORY). 

4.1 Analyza 

Zo zozbieranych zaznamov sme nahodnym vyberom vybrali priblizne 9 mil., ktore sme 
podrobili frekvencnej analyze. Charakteristika vybranej vzorky dat je v Tab. 1. 


Tab 1. Frekvencna analyza vybranej vzorky dat. 


typ udalosti 

zariadenia 

zaznamy 

unikatne zaznamy 

SMS 

10 

3 231 

462 

14.30% 

CALLS 

11 

3 187 

651 

20.43% 

INTENT RECEIVED 

18 

729 332 

572 353 

78.48% 

PROCESSES 

18 

4 372 613 

3 992 536 

91.31% 

CONNECTIONS 

18 

1 951 405 

1 797 525 

92.11% 

BROWSER HISTORY 

15 

1 919 961 

1 098 979 

57.24% 

celkovo 

18 

8 979 729 

7 462 506 

83.10% 


Frekvencna analyza pozostavala z vytvorenia statistiky o hodnotach jednotlivych atri- 
butov. Ciel’om bolo identifikovat’ take atributy, ktore specifikuju urcity typ udalosti, 
aby sme pomocou nich mohli udalosti transformovat’ na vseobecnejsie slova. Napr. 
udalosti typu CONNECTION bolo az 92% jedinecnych, teda sa skoro vobec neopako- 
vali. Dovodom bola siroka skala hodnot atributov ako aj pocet samotnych atributov. 
Potrebovali sme preto odstranit’ niektore atributy, pripadne kategorizovat’ ich hodnoty. 
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Z udalosti, ktore obsahovali atributy: time, gps, acc, from_addr, fromjport, toaddr, 
to_port, state, uid, application, protocol a imei sme ponechali len atributy application, 
to jport, protocol a state. Tie sme potom transformovali na slova reprezentujuce uda¬ 
losti vtvare: connection://{application}/{to_port} /{protocol} /{ state} . Vy- 
ber atributov bol urobeny na zaklade odporucani experta na zozbierane data. 

4.2 Predspracovanie 

Ako bolo spomenute v predchadzajucom priklade, udalosti sme filtrovali a transformo¬ 
vali z vektorov na slova (event), pricom kazde slovo malo aj svoju casovu znacku 
(time). Transformacia prebehla podl’a nasledovnej schemy: 

time event 

YYYY-MM-dd HH:mm:ss.SSS type://attr_l/attr_2/.../attr_n 

kde type bol typ udalosti (napr. BROWSER HISTORY pre udalost’ z weboveho pre- 
hliadaca) a attr\, attr 2 az attr n hodnoty vybranych atributov (napr. protokol). Uda¬ 
losti transformovane na slova sme d’alej spajali do sekvencii, ktore bob ekvivalentne 
vetam v prirodzenom jazyku a udalosti v sekvenciach zasa slovam vo vetach. Udalosti 
sme spajali tak, aby casovy rozdiel medzi dvoma po sebe iducimi udalost’ami v sek¬ 
vencii nebol vacsi ako 10 sekund. Tuto hodnotu sme zvolili po predchadzajucej diskusii 
s expertom na mobilne zariadenia. Spajanie udalosti do sekvencii nam umoznovalo ne- 
skor generovat’ n-gramy a z nich potom pravdepodobnostny model ret’azcov udalosti, 
podobne ako by to bolo pri texte. 

4.3 Vzorky utokov 

Pomocou experta boli vykonane simulacie dvoch typov utokov: a) Utokl - ziskanie 
citlivych udajov (Cookies, ulozene hesla, auto-fill data) a b) Utok2 - ziskanie vzdiale- 
neho pristupu na shell napadnuteho zariadenia. K dispozicii sme mali 97 vzoriek utokov 
typu 1 (59) a 2 (38), ktore sme zo zaznamov ziskali podl’a casovych intervalov zaciatku 
a konca simulacie utokov. Vzorky mali v priemere okolo 400 slov (udalosti). 

5 Realizacia experimentu 

Na odporucanie experta, ktory vykonaval simulacie utokov, sme si zvolili utok typu 2 
a pre kazdu vzorku utoku tohto typu sme vygenerovali n-gram modely stupna 1 az 6 s 
vyhladzovanim pravdepodobnosti algoritmom Kneser-Ney. Pre kazdy z 38 utokov typu 
2 sme tak mali 6 modelov, ktore sme nasledne vyhodnocovali nad dvoma typmi datase- 
tov: CLEAN Datasety udalosti bez podozrivej aktivity - Vytvorili sme jeden dataset 
zo zaznamov zariadenia A v obdobi styroch dni. Celkovo obsahoval 47 132 udalosti v 
901 sekvenciach. Tento dataset sme pouzivali ako testovaci dataset s bezpecnymi uda- 
lost’ami. ATTACK Datasety udalosti s utokom typu 2 - Pre kazdu z 38 vzoriek utoku 
typu 2 sme vytvorili jeden dataset. Kazdy dataset bol tvoreny sekvenciami udalosti, kde 
dve udalosti nasledujuce v sekvencii po sebe neboli od seba v case vzdialene viac ako 
10 sekund. Z tychto 38 datasetov bolo 19 datasetov zo zaznamov zariadenia A. Ostatne 
boh zo zariadenia B (15 utokov) a z C (4 utoky). Pre potreby experimentu sme si vybrali 
19 datasetov zariadenia A, z ktorych sme vygenerovali rovnaky pocet n-gram modelov 
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pre kazdy jeden stupen (1 az 6). Tieto modely d’alej spominame ako attack modely. 
Zvysne datasety a tiez aj tychto 19 vybranych sme pouzili ako testovacie datasety s 
podozrivymi udalost’ami. Ked ze sme testovacie datasety CLEAN a ATTACK vytvorili 
zo zariadenia, na ktorom bob simulovane utoky, overili sme si, ze sa nam podozrive 
udalosti nedostali do CLEAN datasetu. Evaluaciu attack modelov sme vykonali v 
dvoch krokoch. V prvom kroku sme attack modely evaluovali nad vsetkymi ATTACK 
datasetmi okrem tych, z ktorych bob modely vytvorene (nezavislost’ od trenovacieho 
datasetu). Vykonali sme tak spolu 342 evaluacii pre kazdy stupen n-gramu (1 az 6) a 
sledovali sme metriky perplexity a logaritmicku pravdepodobnost’ prislusnosti re- 
t’azcov datasetu k namodelovanym utokom. V druhom kroku sme evaluovali attack 
modely nad CLEAN datasetom. Rovnako ako v prvom kroku, tak aj v druhom kroku 
sme sledovali metriky perplexity a logaritmickej pravdepodobnosti. Namerane hodnoty 
ziskane v oboch krokoch sme zobrazili v grafe na Obr. 1. Ako je vidiet’ z obrazka, 
hodnoty logaritmickej pravdepodobnosti ziskane evaluaciou attack modelov nad 
ATTACK datasetmi (krok 1 - cervena farba) dosahovali vyssie hodnoty pravdepodob¬ 
nosti ako pri evaluacii nad CLEAN datasetom (krok 2 - modra farba; cislami su podl’a 
vzorky utoku oznacene attack modely). V pripade sledovanej metriky perplexity to uz 
take jednoznacne nebolo. Vysledkom vsak bolo, ze pomocou logaritmickej pravdepo¬ 
dobnosti by bolo mozne odlisit’ podozrive vzorky zaznamov udalosti od beznych. 
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Obr. I Vysledky evaluacie attack modelov nad CLEAN a ATTACK datasetmi 


6 Zaver 

Prvotne vysledky vykonanych experimentov ukazuju, ze by bolo mozne vyuzit’ metody 
modelovania prirodzeneho jazyka aj v oblasti bezpecnosti mobilnych zariadeni. V ex- 
perimente sme evaluaciou modelov skodlivych udalosti ziskali vyrazne vyssie hodnoty 
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logaritmickej pravdepodobnosti pre sekvencie udalosti so skodlivou aktivitou ako bez 
nej. Toto pozorovanie by sa mohlo vyuzit’ napr. na natrenovanie binameho klasifikatora 
sekvencii udalosti, ktory by bol s urcitou mierou pravdepodonosti schopny dekekovat’ 
nebezpecne aktivity v mobilnom zariadeni. 

Pod’akovanie: Tato publikacia bola podporena projektami VEGA 2/0167/16 a EGI- 
Engage EU E12020-654142. Zaroven by sme sa chceli pod’akovat’ vsetkym kolegom, 
partnerom a domenovym expertom, ktorl s nami spolupracovali a diskutovali. 
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Annotation: 

Detection of malicious activity in the event logs of mobile devices 

In this paper, we present experiment conducted within a mobile security project. We tried to 
apply methods of Natural Language Modelling in the domain of mobile device security. The 
point was to investigate, whether n-gram models created from event logs of mobile devices can 
be used to detect malicious events or sequences of such events. 
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Abstrakt. Metody dolovania a strojoveho uceniaje mozne aplikovat’ v mnohych 
domenach. A vsak viacere spomedzi oblasti generuju len obmedzeny objem dat, 
resp. ziskanie vacsieho objemu dat je drahe, casovo resp. technicky narocne. Aj 
v tychto oblastiach vsak vznika potreba modelovania a predikcie, pricom nizsi 
objem dat sposobuje problematicke zovseobecnenie vlastnosti, co sa prejavi niz- 
sou presnost’ou modelu. Clanok pojednava o datovej transformacii, ktora kla- 
sicku regresnu ulohu transformuje na ulohu s vyrazne vyssim poctom zaznamov 
s ciel’om zvysenia presnosti modelovania. Clanok prinasa modifikaciu datovej 
transformacie ako aj jej otestovanie na realnych datovych mnozinach. Pri tom 
porovnava a hodnoti dosiahnutu vykonnost’ natrenovanych modelov. 


Typ prispevku: Vyskumny prispevok 

Kl’iicove slova: datova transformacia, regresia, modelovanie, dolovanie 


1 Uvod 

Sucasnym trendom v mnohych castiach informatiky je rozhodne oblast’ vel’kych dat. 
Tato oblast’ poskytuje vyzvu pri riesenl problemov efektlvneho narabania so zdrojmi, 
skalovatel’nosti, ako aj pouzitia vhodnej distribuovanej architektury a podobne. Pri do- 
lovanl v udajoch sa vsak casto stretavame s opacnym prlpadom, kedy nemozeme ho- 
vorit’ o vel’kych datach; pri dostupnosti len niekol’ko tisic zaznamov ci dokonca menej. 
Taketo pripady nastavaju v domenach, kde meranie a zbieranie dat je casovo, alebo 
technicky narocne, resp. ekonomicky nakladne. Problemom sa tak skor stava reprezen- 
tativnost’ datovej mnoziny a schopnost’ generalizovania vzt’ahov modelom, co sa pre¬ 
javi znizenim miery presnosti modelu. Aj napriek tymto problemom vsak vznika po¬ 
treba modelovania a predikcie velicin aj z tychto oblasti. V sucasnosti sa za ucelom 
spresnovania modelov zvycajne pouzivaju metody zdruzeneho ucenia [1], Tieto me¬ 
tody casto zlucuju viacere rozdielne typy modelov, ktore tak vzajomne kompenzuju 
svoje slabe stranky. Vysledny zdruzeny model, ktory je zlozeny z ciastkovych modelov 
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tak obvykle dosahuje vyssiu mieru presnosti predikcii, pretoze predpovede su vysled- 
kom hlasovania ciastkovych modelov. Druhym casto pouzivanym principom v zdruze- 
nom uceni 1 je viacnasobne trenovanie jedneho typu modelu, pricom vahy jednotlivych 
zaznamov sa menia v zavislosti od uspesnosti predpovedl. Tento sposob pouziva aj 
znama metoda AdaBoost. Viacere spomedzi metod zdruzeneho ucenia (Boosting, 
Bagging) boli povodne urcene pre ulohu klasifikacie do tried, avsak neskor boli navrh- 
nute aj modifikacie pre ulohu regresie [2], [6], Celkovo vsak tieto metody vychadzaju 
z principu zlucenia viacerych modelov, co vedie k zlozitej strukture vytvoreneho mo- 
delu. V nasom prispevku vsak pouzivame len jeden model, ktory je trenovany na tran- 
sformovanych datach. Dalslmi vyhodami tohto pristupu su moznost’ pouzitia dodatoc- 
neho zdruzeneho ucenia (pre d’alsie spresnenie), ako aj moznost’ vol’by typu pouziteho 
modelu. 

1.1 Zakladny princfp transformacie 

Zakladnou ideou transformacnej techniky je transformovat’ povodnu regresnu ulohu na 
ekvivalentnu s vyssim poctom zaznamov a atributov tak, aby tieto data boli svojou 
strukturou vhodnejsie pre proces strojoveho ucenia. Za tymto ucelom je pouzita datova 
transformacia, ktora z povodnej datovej mnoziny postupne vybera vsetky mozne dvo- 
jice zaznamov, pricom jedna dvojica vytvara jeden zaznam transformovanej datovej 
mnoziny. Z povodnych N zaznamov v datovej mnozine ziskame N 2 - N zaznamov v 
transformovanej mnozine. Pocet vstupnych atributov sa zvysi dvojnasobne, kedze 
okrem prislusneho atributu bude zastupena aj diferencia prislusneho atributu. Prezen- 
tovana transformacia je vhodna v pripade ulohy regresie, vyhradne pri spojitych nume- 
rickych atributoch, obzvlast’ v pripade mensich datovych mnozin. 

Uvazujeme, ze vstupne data, s homogennou strukturou - v tvare tabul’ky uz boli 
predspracovane a obsahuju vybrane relevantne vstupne atributy. Po aplikovani datovej 
transformacie budu tieto udaje transformovane do struktury obsahujucej okrem povod¬ 
nych hodnot aj ich diferencie. Ciel’ovou velicinou sa stane diferencia z povodnych cie- 
l’ovych velicin. Pri predikcii je nutne opatovne realizovat’ datovu transformaciu na pre- 
dikovany zaznam, ktory sparujeme so zaznamami z trenovacej mnoziny. Ziskame tak 
vacsi pocet odhadov ciel’ovej veliciny, z ktorych nasledne urcime finalnu hodnotu cie- 
l’ovej predikovanej veliciny. Podrobnejsie je datova transformacia, jej zakladne as- 
pekty, strategic urcenia ciel’ovej hodnoty ako aj proces predikcie popisane v publikacii 
[5], 

Princip ktory umoznuje, aby tato technika dosahovala zlepsenie ma niekol’ko aspek- 
tov. V prvom rade, pouzitie rozdielov (diferencii) do urcitej miery vyjadruje mieru 
vzdialenosti (distance) v jednotlivych atributoch. V pripade ak 2 zaznamy obsahuju 
vyrazne podobne hodnoty vstupnych atributov, je vel’mi pravdepodobne ze aj ich cie- 
l’ove atributy budu mat’ podobne hodnoty (za predpokladu ze vstupne atributy su rele¬ 
vantne). V prirodzenych systemoch so spojitymi velicinami sa vel’mi casto pouzivaju 
pristupy, vysetrujuce dopad zmeny vstupu na zmenu vystupu. Taketo pristupy vjmzi- 


1 http://www.machine-learning.martinsewell.com/ensembles/ensemble-leaming.pdf 
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vajuce diferencie boli pouzite aj pri modelovani v ramci kauzalnej analyzy [3],[4], Sle- 
dovanie nie len hodnot, ale aj zmien hodnot teda umoznuje spresnenie vysledneho mo- 
delu. To suvisl aj s narabamm s hodnotami v procese trenovania. V procese trenovania 
sa modeluje zavislost’ medzi vstupom (vstupmi) a ciefovyrn atributom. Avsak bezne 
pouzivane sposoby trenovania zvycajne nezohl’adnuju sucasne viacere zaznamy a uz 
vobec nie rozdiel medzi ich hodnotami. Je to vsak pochopitel’ne, vzhl’adom na fakt, ze 
zohl’adnenie takychto rozdielov by bolo vyrazne casovo narocne. Avsak, vynimku tvori 
model k-najblizsich susedov (ktory do vel’kej miery inspiroval aj vznik tejto transfor¬ 
macie), ktory sice model ako taky netrenuje, avsak zohl’adnuje aj rozdiely hodnot v 
atributoch, z ktorych nakoniec pocita vzdialenosti zaznamov. 

V druhom rade sa jedna o statisticky fakt, ked’ze z vacsieho mnozstva nezavislych 
odhadov, dokazeme ziskat’ presnejsiu predpoved’ ciel’oveho atributu. Vacsi pocet od- 
hadov taktiez umoznuje pouzitie rozlicnych strategii urcenia finalnej predpovedanej 
hodnoty (aritmeticky priemer, vahovany priemer, odstranenie extremov, vyber najbliz- 
sich zaznamov, pripadne ich kombinacie). 

V porovnani s povodnou verziou pristupu [5] vyuzivajuceho datovu transformaciu, 
bolo vykonanych niekol’ko zmien. V procese predikcie neboli pouzite vsetky dostupne 
zaznamy z trenovacej mnoziny (tak ako v povodnej verzii), ale len K zaznamov s naj- 
nizsou euklidovou vzdialenost’ou voci predikovanemu zaznamu. Hodnotu parametra K 
teda mozeme podl’a potreby ladit’, pre dosiahnutie lepsich vysledkov metody; v nasom 
pripade bola zvolena hodnota K = 10. Na vybrate zaznamy bol aplikovany natrenovany 
regresny model, cim sme ziskali 2K odhadov ciel’ovej hodnoty. Dalsim rozdielom v 
porovnani s predchadzajucou verziou pristupu, je pouzitie vahovania pri priemerovani 
ziskanych odhadov. Vahy jednotlivych zaznamov boli urcene na zaklade prevratenej 
hodnoty vzdialenosti. Pre zabranenie deleniu nulou - v pripade ekvivalentnych zazna¬ 
mov bola k vzdialenosti pripocitana konstanta 0,01. Dalsim rozdielom, oproti povodnej 
verzii bolo pouzitie normalizacie vstupnych atributov dostupnych udajov na interval 0 
az 1. Dovodom bolo zabranenie vplyvu rozdielnych rozsahov jednotlivych atributov na 
urcenie vzdialenosti dvojice zaznamov. 

Pre objektivnejsie zhodnotenie vhodnosti transformacie a vykonnosti modelov bola 
validacia vykonana 10-nasobne. Pri kazdom z 10 opakovani, trenovacia mnozina po- 
zostava z nahodne vybratych zaznamov spomedzi dostupnych, pricom testovacia mno¬ 
zina obsahovala zvysne zaznamy. Podmienky pri porovnavani dosiahnutych vysledkov 
s pouzitim a bez pouzitia datovej transformacie boli totozne (boli dokonca pouzite rov- 
nake seedy pre nahodny vyber zaznamov do trenovacej mnoziny). Taktiez z dovodu 
vyssej objektivnosti boli pouzite 2 typy regresnych modelov - neuronova siet’ a stro- 
movy model M5P, ako aj viacere datove mnoziny. Pri experimentoch, bol okrem as- 
pektu presnosti predikcie sledovany aj casovy aspekt predikcie, ako aj moznost’ urcenia 
intervaloveho odhadu ciel’oveho atributu pre konkretny predikovany zaznam. 



Pouzitie transformacnej regresnej techniky pre dolovanie v udajoch 78 


2 Dosiahnute vysledky 

Ciel’om tohto clanku je otestovat’ prezentovanu datovu transformaciu s modifikovanou 
strategiou urcenia finalnej hodnoty na realnych datovych mnozinach. Ako datove mno¬ 
ziny boli pouzite mnoziny Combined Cycle Power Plant Data Set 2 (oznacena ako Po- 
werPlant) a Energy Efficiency 3 . Uvedene mnoziny obsahuju 9568 a 768 zaznamov pri 
4, resp. 8 ciselnych atributoch. Pre obe datove mnoziny boli realizovane rovnake expe- 
rimenty. Z datovej mnoziny bolo nahodne zvolenych 200 zaznamov, ktore tvorili tre- 
novaciu mnozinu. 

V prvej faze boli natrenovane 2 typy regresnych modelov (neuronova siet’ a regresny 
strom) nad originalnymi datami. Kazde trenovanie bolo realizovane 10 krat, s rozdiel- 
nymi hodnotami seedu, pre odlisne inicializacne nastavenie siete, ako aj odlisne zvo- 
lene zaznamy v trenovacej mnozine. Na zvysnych zaznamoch boli modeli validovane, 
pricom z 10 opakovani bol urceny priemer. Cely tento proces bol opakovany aj pre 
nizsi pocet zaznamov - 195, 190, 185,... 80. V druhej faze boli za rovnakych podmienok 
(pri pouziti rovnakych hodnot seedu, ako aj rovnakych vybratych zaznamov) vytvorene 
aj modeli z transformovanych dat. 

Tabul’ky Tab 1. a Tab 2. demonstruju priemerne dosiahnute presnosti modelov vy- 
jadrene korelacnym koeficientom (KK) a strednou kvadratickou chybou (SKCH). Prie- 
merne hodnoty su vypocitane vzdy z 10 realizovanych opakovani. Ako prvy model bola 
pouzita viacvrstvova neuronova siet’ perceptronov s 2 skrytymi vrstvami, pricom akti- 
vacnou funkciou bol sigmoid. Koeficient ucenia bol zvoleny 0.3 a maximalny pocet 
epoch bol nastaveny na 500. Druhym modelom bol regresny strom M5P [7] pri pouziti 
minimalneho poctu zaznamov na list 4, s orezanim. 


Tab I. Porovnanie vykonnosti natrenovanych regresnych modelov s pouzitim a bez pouzitia 
datovej transformacie na datovej mnozine PowerPlant. 



Model neuronova siet’ 

Stromovy model M5P 

140 

zaznamov 

180 

zaznamov 

140 

zaznamov 

180 

zaznamov 

s 

transform. 

KK 

0.9595 

0.9646 

0.9652 

0.9629 

SKCH 

4.8021 

4.5278 

4.4963 

4.6251 

bez transf. 

KK 

0.9656 

0.9663 

0.9636 

0.9641 

SKCH 

5.1683 

5.0259 

4.5938 

4.5534 


2 ML Repository: http://archive.ics.uci.edu/ml/datasets/Combined+Cycle+Power+Plant 

3 Machine Learning Repository: https://archive.ics.uci.edu/ml/datasets/Energy+efficiency 
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Tab 2. Porovnanie vykonnosti natrenovanych regresnych modelov s pouzitim a bez pouzitia 
datovej transformacie na datovej mnozine Energy Efficiency. 



Model neuronova siet’ 

Stromovy model M5P 

140 

zaznamov 

180 

zaznamov 

140 

zaznamov 

180 

zaznamov 

s 

transform. 

KK 

0.9977 

0.9993 

0.9982 

0.9993 

KCH 

0.8642 

0.6981 

0.8308 

0.7035 

bez transf. 

KK 

0.9776 

0.9862 

0.9804 

0.9856 

KCH 

2.2451 

1.9694 

2.1177 

1.8697 


Trenovacie mnoziny pozostavali zo 140 resp. 180 nahodne vybratych zaznamov, testo- 
vacie mnoziny obsahovali zvysne nepouzite zaznamy. Ako modeli boli pouzite viac- 
vrstvove neuronove siete perceptronov, ako aj regresne stromy M5P. Pri trenovani mo¬ 
delov bola pouzita kniznica Weka. Pri validacii boli vycislene kriteria korelacny koefi- 
cient (KK) a stredna kvadraticka chyba (SKCH). Vysledne vykonnosti uvedene v ta- 
bul’kach predstavuju priemer z 10 validacii. 

Z dosiahnutych vysledkov v Tab. 2 si mozeme vsimnut’ pri pouziti transformacie 
vyrazny narast presnosti natrenovanych modelov v oboch sledovanych kriteriach. V ta- 
bul’ke su uvedene iba pripady s poctom zaznamov 140 a 180, avsak aj ostatne testovane 
pripady s poctom zaznamov 80 az 200 dosahuju vyrazne podobne vysledky. 

Pri testovani transformacie na datovej mnozine PowerPlant, ktorej vysledky su uvedene 
v Tab. 1, spresnenie nie je aznatol’ko vjrazne. Obzvlast’je to zrejme vpripade pouzitia 
regresnych stromov M5P. Pri pouziti neuronovych sieti sa so zvysujucim poctom za¬ 
znamov v trenovacej mnozine zvysuje aj presnost’ modelu. Celkovo vsak modeli natre- 
novane za pouzitia datovej transformacie dosahuju v priemere lepsiu vykonnost’, len v 
niekol’kych individualnych pripadoch dosiahli o nieco horsiu vykonnost’. 

Z casoveho aspektu, je trenovanie a predikcia za pouzitia prezentovanej datovej 
transformacie vjrazne casovo narocnejsia. Tento aspektje vsak ocakavany, vzhfadom 
na potrebu viacnasobneho aplikovania modelu ako aj datovej transformacie na predi- 
kovane data. Predikcia za pouzitia transformacie je priblizne stonasobne pomalsia, 
v zavislosti od typu strategic urcenia finalnej predikovanej hodnoty a poctu odhadov. 
Prezentovanu techniku je preto vhodne pouzit’ v pripade, ak primarnym kriteriom je 
vysoka presnost’ modelu a pripadne vyssie casove naroky nie su prekazkou. 

3 Zaver 

Celkovo, prezentovana transformacna technika vykazuje potencial, spocivajuci v zvy- 
seni presnosti regresnych modelov. Ukazalo sa to na syntetickych [5] ako aj realnych 
datach, pricom zlepsenie presnosti modelu bolo zrejme z oboch sledovanych kriterii - 
korelacneho koeficientu, ako aj strednej kvadratickej chyby. Z casoveho hl’adiska, po- 
uzitie tejto techniky znacne zvysuje casovu narocnost’ (obzvlast’ vo faze predikcie), co 
je vsak efekt, ktory bol pri navrhu techniky ocakavany. Je preto vhodne zvazit’ pouzitie 
tejto techniky v zavislosti od poziadaviek na presnost’ a rychlost’ predikcie modelu, ako 
aj poctu zaznamov v trenovacej mnozine. Celkovo vsak prezentovana transformacna 
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technika vykazuje viacero pozitlvnych aspektov, medzi ktore patria aj moznost’ vofby 
typu modelu, moznost’ realizacie intervaloveho odhadu ciel’ovej hodnoty, moznost’ 
vol’by strategie urcujucej vypocet ciel’ovej hodnoty ako aj vyrazne spresnenie modelov. 
Je zrejme, ze nie u vsetkych realnych datovych mnozin dojde k takto vjraznemu zvy- 
seniu presnosti. Do buducnosti tak zostava potreba podrobnejsie otestovat’ techniku na 
d’alslch datovych mnozinach. Zaujlmave by tiez bolo porovnanie presnosti modelov 
pouzivajucich prezentovanu techniku a metodu boostingu. 

Pod’akovanie: Tato publikacia vznikla vd’aka podpore projektu VEGA 2/0167/16. 
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Annotation: 

Using Transformation Regression Technique for Data Mining 

Data mining and machine learning methods can be used in many domains. However, several 
domains generate limited volume of data only, because getting larger data sets is difficult from 
time, economical, or technical aspects. But these domains also require a modelling and predict¬ 
ing; so the small data volume can cause problems in generalization and decreasing of model 
precision. Presented paper deals about data transformation, which original regression task replace 
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Abstrakt. Po strucnem uvodu do problematiky rodinneho podnikani popiseme, 
jaka data jsou verejne dostupna, a naznacime moznosti metod strojoveho uceni 
pro analyzu dat souvisejicich s rodinnym podnikanim. Uvadime tez prvni vy- 
sledky z pilotni studie o 67 rodinnych podnicich. 

Typ prlspevku: Prispevek o probihajicim vyzkumu 

Kl’iicove slova: rodinne podnikani, strojove uceni, klasifrkace 


1 Uvod 

Rodinne podnikani je zajlmavou alternativou k masove anonymizovane produkci z ne- 
tvurci prace a v fade nam bllzkych spolecenstvech je podporovano statem. Podobne je 
tomu i v Ceske republice. Ponekud opozdeny je vsak nejen rozvoj rodinneho podnikani 
u nas, ale i legislativnl proces. Pritom podle Karla Havllcka, predsedy predstavenstva 
Asociace malych a strednlch podniku a zivnostnlku CR (AMSP) je v CR 260 000 ma- 
lych a strednlch podniku, z nichz cca 70% muze byt podniku rodinnych, die odhadu 
AMSP na vzorku (viz konference Budoucnost rodinnychfirem v CR - inspirace i vyzvy, 
24. unora 2016, usporadana u prllezitosti spustenl weboveho portalu o rodinnych fir- 
mach majitelefirem.cz.) Podle Jirlho Hnilici (VSE, tamtez) jednotna definice neexis- 
tuje. Jednotlive definice vsak obvykle za zasadnl povazujl vlastnicky podll rodiny, pre- 
devslm 

1. existujlcl zamer firmu predat v ramci rodiny, 

2. smerovani ovlivnuje vice nez jedna generace a 

3. rodina nejenom firmu vlastni, ale podili se na nzeni. 

Vice kteto problematice u nas viz [4], 

Kazda zeme zejmena na zapad od Smolenic, ale i napr. Slovensko, specifikaci ro¬ 
dinneho podniku vzakone ukotvenou ma. Bod 1. asi sotva muzeme zjistit jinak nez 
pfirnym dotazem. Pro tento prispevek povazujeme za rodinny podnik takovy, ktery 
splnuje bod 3. a slabsi variantu bodu 2. - ve vedeni firmy jsou alespon dva clenove 
rodiny. Navic pozadujeme, aby majetkovy vklad rodiny byl rozhodujici. 
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V projektu TACR Rodinny podnik - resent socialnich a ekonomickych disparit obci, 
jehoz je tato prace soucasti, se pokousime metodami analyzy dat prispet k popisu exis- 
tujiciho stavu a k vyjasneni nekterych pojmu (ci postupu) s rodinnym podnikanim spo- 
jenych. Jednim z cilu muze byt zjisteni, jak pro ruzne definice rodinneho podniku jsem 
schopni takovy rodinny podnik rozpoznat automaticky a jakjsou existujici data vhodna 
pro dalsi analyzu, napr. shlukovani, predikci nebo detekci anomalii, metodami strojo- 
veho uceni [2], 

Predpokladame, ze popis stavu rodinneho podnikani v nasi zemi a urceni vitality 
(zdravi) rodinneho podniku jsou dve oblasti projektu, kde strojove uceni muze byt 
uzitecne. Tento text nam slouzi tez k prvnimu zamysleni nad takovym pouzitim. 

2 Informace o rodinnem podnikani 

2.1 Vyhledavace a portaly 

Prvni otazkou bylo, zda je mozne rodinne podniky najit pomoci weboveho vyhleda¬ 
vace. Zkouseli jsme tfi jednoduche dotazy (pouzit Maxthon, svobodny multiplatformni 
sesty celosvetove nejpouzivanejsi webovy prohlizec vyvijeny spolecnosti Maxthon In¬ 
ternational) a sledovali pro prvnich 20 odkazu presnost (precision, tj. kolik z vyhleda- 
nych odkazu splnuje nasi definici) 

asynfirma 16/20 80% 

rodinnd firma s tradici 13/20 65% 

tradicni rodinnd firma 10/20 50% 

I kdyz tento experiment nebyl rozsahly, vidime, ze deklarovany rodinny podnik nemusi 
rodinnym podnikem byt. 

Taktonalezene strankyjsou ovsem vytvoreny pro jiny ucel, a proto temer neobsahuji 
dulezite ekonomicke informace. Pro to jsou vhodnejsi data z portalu www. de¬ 
tail.cz, pfipadnewww.Merk.cz a http://wwwinfo.mfcr.cz/ares/. 
Prvni z nich podava o kazde z firem velmi dobry prehled. Tato placena sluzba totiz 
poskytuje souhm vseho, co se da na intemetu o firme najit. Pro nase potreby se zda 
dostacujici treti, ARES - Administrativni registr ekonomickych subjektu, i kdyz v nem 
nektera data chybi. Obsahuje krome adresy a hlavni oblasti podnkani mj. i udaje z 
verejne cash Zivnostenskeho rejstriku vcetne historic, a informaci o spolehlivosti platce 
DPH. 

2.2 Data z dotaznfkoveho setreni 

Dalsi moznosti, jak ziskat presnejsi data, jsou ruzne typy dotaznikovych setreni. Prvni 
pilotni dotaznikove setreni nedavno provedene Technickou univerzitou Liberec [3] 
(vice viz http://vyzkum.ef.tul.cz/td03000035/) obsahuje udaje o 67 rodinnych podni- 
cich. Protoze se dat tyka i nasledujici analyza, uvadime seznam jednotlivych polozek, 
znichz se zaznam o rodinem podniku sklada, ajejich typu. Jedna se o Nazev 
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firmy, ICO, Typ podnikani (SRO,AS,FO,VOS), Datum, kdy spolecnost zacala existo- 
vat, Jak je spolecnost stara, Pocet zamestnancu, Kod oblasti podnikani, Zda firma cer- 
pala statni ucelove dotace, Objem cerpanych dotaci, Firma zacala jako zivnostnik, In- 
solventnost, Vek zakladatele, Kapital firmy, Je platcem DPH, Kraj 
(LB,USTI,PR,ZLIN,CB) a Smlouva s obci. 

3 Experimenty s daty z dotaznikoveho setreni 

Protoze jsme v dobe odevzdani tohoto textu nemeli k dispozici data o ne-rodinnych 
podnicich, omezujeme se tu jen na rozpoznani nekterych charakteristik rodinnych pod- 
niku. Pouzili jsme vzdy vsechny atributy popsane nahore krome ICO. Jako kriterum 
kvality jsme pro klasifikaci zvolili celkovou spravnost (accuracy - relativni pocet 
spravne klasifikovanych instanci z testovaci mnoziny) a pro regresni ulohy korelaci a 
RRSE (relative root squared error v %). Hodnota baseline odpovida danemu kriteriu 
pri nahodne klasifikaci. Pro analyzy jsem vyuzili nastroj Weka [1] a vsechny metody s 
defaultnim nastavenim, lOti slozkovou krizovou validaci. Uvadime vysledky jen pro ty 
atributy, kde byl rozdil oproti baseline vyrazny (vie nez 5% u klasifikacnich uloh). 
Vzdy je uvedena ucici metoda s nejlepsi presnosti a baseline. 

Je platcem DPH. Pro rozhodovaci strom (metoda J48) obsahujici pouze tri atributy 
- Capital, Employee, Birth of pioneer - celkova spravnost dosahla 88.1% pri baseline 
71.7%. Pomeme silna zavislost. 

Firma zacala jako zivnostnik. Nejlepsi vysledek jsme dosahli s Random Forest, 
kde spravnost presahla 67.2 %, baseline 56.7%. Zavislost tohoto atributu na zbyvajicich 
je tedy slaba. 

CEDR - byla prijemcem dotaci. Po odstraneni atributu s objemem dotaci byla 
spravnost 79.1% (J48, baseline 58.2 %) a fakt ziskani dotaci zavisel jen na poctu za¬ 
mestnancu - cim vetsi firma, tim vetsi pravdepodobnost ziskani dotace. 

Vek zakladatele. Pro tuto regresni ulohu jsme pouzili Random Forest. Korelacni 
koeficient presahl 0.57 (baseline -0.35), RRSE 89.9% . 

Pocet zamestnancu. Druhou regresni ulohou bylo zjisteni, zda je pocet zamestnancu 
zavisly na ostatnich atributech. Zde korelacni koeficient dosahl 0.45 a RRSE 93.6 %. 
Zavislost tedy existuje, ale neni silna. 

4 Zaver 

V teto kratke uvodni studii jsme se soustredili na rodinne podnikani z pohledu zpra- 
covani dat a uvedli prvni vysledky ze zpracovani pilotni studie o rodinnych podnicich. 

V soucasne dobe (rijen 2016) byl ukoncen sber dat, ktera popisuji stav rodinneho pod¬ 
nikani v malych sidlech na celem uzemi Ceska. Na jejich zaklade sestavujeme datovou 
sadu negativnich prikladu, tj. ne-rodinnych podniku, ktere se rodinnym podnikum co 
nejvice podobaji, ci dokonce rodinnost deklaruji a pritom rodinnjnni nejsou. 

Podekovani. Dekujeme Pavle Suchankove a Katerine Bekove za uvodni experimenty. 
Tato prace byla castecne podporovana grantem TACR - Omega, TD03000035 Rodinny 
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podnik— resenisocialnich a ekonomickych disparit obci a Fakultou informatiky Masa- 
rykovy university Brno. 
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Annotation: 

Machine learning for family business analysis 

We give a brief overview of family business, mention difficulties when formulating a definition 
of a family enterprise and then describe what data are nowadays available for data analysis. We 
show how machine learning methods can be used for analysis of a 67 family enterprises. In conc¬ 
lusion we mention future work. 
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Abstract. Aircraft engine failures can be expensive and an obvious security 
threat. When we are able to predict a potential failure of an engine in advance, 
then we can send the aircraft for maintenance. Sensor data is collected during 
engine starts, takeoffs, cruise or special events. Aim of this research is to create 
a model of standard behavior of so called healthy engines and based on that, de¬ 
tect serious change which can predicts a failure. Furthermore, we want to distin¬ 
guish among particular failure types. The model don’t have just to be able to 
successfully pass data tests but also should have some physical explanation. 
Sometimes the resultant model shows big dependences on attributes which 
should be at most auxiliary, or it shows physically improbable relations among 
attributes. We present the first results obtained with One-class Support Vector 
Machine, which show significant increase of the anomaly factor of two out of 
four faulted engines when they were approaching the failure. We also made ex¬ 
periments with group anomaly detection. 

Contribution type: Work-in-progress paper 

Key words: fault prediction, aircraft engine, support vector machine, group 
anomaly detection 


1 Introduction 

Every flight phase is getting different sensor records which are captured as one record 
per event. It can be one snapshot at the time when the event occurred like takeoff, or 
snapshot of sensor records captured during some time period like engine start up. 

In this paper we are focused on engine starts. During a start, an engine goes through 
a number of phases during which various components become dominant. These com¬ 
ponents are measured as fuel flow, speed of high pressure compressor, times between 
phases, various temperatures and pressures. The start phases are related to a different 
types of a failure. 

Our data contains over a million of records from over a thousand engines. We have 
a list of engines built on maintenance reports for every type of fault. The predictive 
model should recognize increasing probability of a fault with no false negatives and 
reasonable number of false positives. 
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Common predictive algorithms don’t show any difference between the classes and 
anomaly detection applied on a non-transformed data shows roughly the same ratio of 
anomalies for healthy and faulted engines. Our research seems challenging, if we take 
into account the fact, that most of predictive algorithms for aircraft engines are based 
just on few attributes with decision boundaries defined by experts. 

2 Feature engineering 

2.1 Domain knowledge 

Domain knowledge is necessary in the phase of attribute selection and feature extrac¬ 
tion. Engine experts identified list of attributes which may be related to a certain fault, 
thus we have subset of reasonable attributes to begin with. When we encounter a false 
positive engine, we can ask whether the engine didn’t have some unusual maintenance 
or other condition which would eliminate it from the testing set of healthy engines. 

2.2 Data transformation 

Feature engineering aims to describe inherent structures which would create the best 
representation of the data. 

Most of the attributes are affected by a seasonal effect which is easily visible on plots 
as a periodical wave. Therefore, we use polynomial regression with ambient tempera¬ 
ture as the independent variable, to eliminate that. 

Values usually oscillate, thus we use moving averages to capture trends. We also 
derived differences between successive records and moving variances. Principal com¬ 
ponents were added to represent linear interactions among features. 

3 One class Support vector machine for anomaly detection 

Support vector machine projects data through a non-linear function to a more dimen¬ 
sional space in which then separates the data to classes. Other possibility is to use a 
kernel function to create a non-linear boundaries without projection to a new feature 
space. 

One class SVM performs unsupervised learning. Training samples define a function 
that takes the value +1 in a small region capturing most of the data points and -1 else¬ 
where [1], The class of training samples is separated by hyperplane with maximal dis¬ 
tance from origin. Normal data provided to SVM creates a representational model. New 
data presented to the model is than assessed by probability of being inside of the model. 

Aggregation of anomaly points can represent each engine and we can simply sort 
engines according to their ratio of those points. Nevertheless, the problem with such 
approach is that it is hard to decide how many records should be included in one group 
representing particular engine. In other words, the question is how many days/records 
before the fault we should see signs of it? Another issue is, that if the ratio is high but 
decreasing before the fault, it doesn’t indicate a coming fault. 
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We have tried to create the model based on healthy engines, which define normal 
behavior and indicate any significant change in it. Other option is to create the model 
based on records from engines with a particular failure, and thus separate small class of 
records. The second approach generates almost no false positives, but when we remove 
one faulty engine from training set and then use the engine in testing, it is recognized 
as healthy for most of the models. Thus, the model is overfitting the training data. 

4 Group anomaly detection 

Common anomaly detection looks at every record individually, but we can have records 
which aren’t anomalies themselves, but their distribution as a group is different from 
other groups. This approach was successfully used for astronomical data and for data 
from high energy particle physics experiments [2], 

For our experiments we chose Mixture of Gaussian mixture model (MGMM) which 
works with multimodal distributions. MGMM assume, that feature vectors in groups, 
are generated by a mixture of K Gaussian distributions [3], 

5 Results 

Successful model should be able to recognize increasing ratio of anomalies for faulted 
engines. We found this behavior in two out of four faulted engines with fuel system 
failures. Results of one class SVN are depicted on Figure 1, where the anomaly score 
is the anomaly level acquired from the algorithm, when the required percentage of 
anomaly points is set to 1%. Flealthy engines have anomalies, but none with such an 
increase. Anomalies in healthy engines and faulted engines outside the timeframe be¬ 
fore the fault, have intermittent character. In other words, they don’t have many con¬ 
secutive anomalies. Now we are investigating other types of start-up failures. Interest¬ 
ing is, that when we took all types of failures together to see, whether we can find 
anomalies indicating any type of problem, we got many false positives and false nega¬ 
tives (depending on algorithm setting), and thus we conclude, that they don’t have com¬ 
mon model. 

Group anomaly detection with MGMM didn’t show any difference between healthy 
and faulted engines, even with various settings of parameters, combinations of features 
and number of records per engine. 

6 Discussion 

Our research leads us to time series which logically seems as an appropriate solution. 
Nevertheless, we have to deal with the fact of irregularity of data capturing. We can 
have many records in one day or few weeks without any record. Records which hap¬ 
pened in one day can be highly dependent on each other. Missing data can be caused 
by engine inactivity or just the fact we didn’t get the data. These conditions should be 
included in the predictive algorithm. 
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Fig. 3. Results from one class SVM show increasing anomaly level of second and third engine 
before the fault occurred. Red vertical line indicates time of the fault and the black horizontal 
line separates normal records (below the line) and anomalies (above the line). 


References 

1. Scholkopf, B.,Williamson, R.,Smola, A.,Shawe-Taylor, J., Platt, J.: Support Vector Method 
for Novelty Detection. MIT Press. 1999, 2000(12), 582-588. 

2. Muandet, K., Scholkopf, B.: One-Class Support Measure Machines for Group Anomaly De¬ 
tection. Proceedings of the Twenty-Ninth Conference on Uncertainty in Artificial Intelli¬ 
gence. 2013, 2013(6). 

3. Xiong, L., Poczos, B., Schneider, J., Connolly, A., Vanderplas, J.: Hierarchical probabilistic 
models for group anomaly detection. JMLR WCP Proceedings of the International Confer¬ 
ence on Artificial Intelligence and Statistics AISTATS. 2011, 2011(15), 789-797. 


















System na podporu rozhodovania pomocou 
jednoducheho a efektivneho pochopenie 
medicinskych zaznamov 


Michal Vadovsky, Frantisek Babic, Miroslava Muchova 

Katedra kybemetiky a umelej inteligencie 
Fakulta elektrotechniky a informatiky 
Technicka univerzita v Kosiciach 
Letna 9/B, 042 00, Kosice, Slovenska republika 

{michal.vadovsky,frantisek.babic,miroslava.muchoval@tuke.sk 


Abstrakt. Medicinske zaznamy predstavuju dolezity zdroj informacii o zdravot- 
nom stave pacientov, ale casto su uchovavane vo forme, ktora neumoznuje ich 
efektivnu spravu a najma vyuzitie pre ucely medicinskej diagnostiky. Ciel’om na- 
sej prace bolo ukazat’ potencial vybranych metod exploracnej analyzy a dolova- 
nia v datach prave na jednoduche a efektivne pochopenie medicinskych zazna¬ 
mov. Na tento ucel sme pouzili vzorku dat z Chorvatska, vramci ktorej su jed- 
notlivi pacienti charakterizovani sirokou skalou parametrov, bezne zist'ovanych 
a vyhodnocovanych v ambulanciach praktickeho lekara. Na zaklade tychto dat 
sme sa snazili identifikovaf klucove symptomy alebo hranicne hodnoty pre diag- 
nostiku ochorenia s nazvom Mierne kognitivne zhorsenie. Dosiahnute vysledky 
su prezentovane vjednoducho pochopitel’nej forme aj pre pouzivatel’a s rnini- 
malnymi znalost’ami zo statistiky alebo analyzy dat - lekara. 

Typ prispevku: Vyskumny prispevok 

Kl’iicove slova: data o pacientoch, analyza, diagnostika 


1 Uvod 

Mierne kognitivne zhorsenie (v angl. „Mild Cognitive Impairment" (MCI)) predstavuje 
medzistupen medzi ocakavanym ubytkom kognitivnych funkcii normalneho starnutia 
a vaznejsim nastupom demencie [1], Problemy s pamat’ou, rozpravanim a myslenim 
nastupuju rychlejsie ako je obvykle pri beznom starnuti obyvatel’ov. Vyskyt MCI mdze 
zvysit’ riziko neskorsieho vyvoja demencie sposobenej Alzheimerovou chorobou 
(ACH) alebo inymi neurologickymi poruchami. Na druhej strane, zdravotny stav l’udi 
s touto poruchou sa casom moze aj zlepsif. Medzi rizikove faktory s najvacsim vply- 
vom na pozitivnu MCI diagnostiku patria napriklad rastuci vek alebo specificka forma 
genu znameho ako APOE-e4, ktory je tiez spajany s ACH. Taktiez sa zohl’adnuje zi- 
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votny styl jedinca, ale dokazy o tychto rizikovych faktoroch uz nie su take jasne (cuk- 
rovka, fajcenie, depresie, vysoky krvny tlak, zvyseny cholesterol, nedostatok telesneho 
pohybu). Vcasna diagnostika tohto ochorenia umozni okamzite nasadit’ vhodne lie- 
cebne postupy, pomocou ktorych je mozne naprlklad spomalit’ alebo zabranit’ progre- 
slvnej strate pamati [4], Bezny postup pri takejto diagnostike je postupny zber vsetkych 
potrebnych vstupnych dat, na zaklade ktorych si lekar nasledne vytvorl celkovy obraz 
o zdravotnom stave pacienta a urobl rozhodnutie. Tento postup je vsak vo vacsine prl- 
padov pomeme casovo narocny a najma si vyzaduje neustaly prehl’ad a pochopenie 
stale rastuceho objemu dat. To otvara priestor pre vznik a nasadenie systemu na pod¬ 
poru rozhodovania, pomocou ktoreho bude moct’ lekar spracovat’ a pochopit’ nielen 
udaje o aktualnom zdravotnom stave pacienta, ale konfrontovat’ ich aj s historickymi 
hodnotami. 

Vzorka dat obsahuje informacie o 93 pacientoch z klinickej praxe, ktoru vykonava 
spolupracujuci expert v Chorvatsku. Medzi tymito pacientmi sa nachadza 35 muzov a 

58 zien vo vekovom intervale 50 az 89 rokov, u ktorych je pomer pozitlvna vs. nega- 
tlvna diagnostika MCI 37 ku 56. Kazdy pacient je zaroven charakterizovany hodnotami 

59 faktorov, ktore predstavuju potencialne dolezite vstupy pre diagnostiku MCI, napr. 
vek, pohlavie, hypertenzia, cukrovka, cholesterol, kardiovaskularne ochorenia, alergie, 
uroven bielych krviniek, uroven cervenych krviniek, atd’. Ciel’ova diagnostika v nasom 
pripade predstavuje binamy atribut: (0) - zdravy clovek, (1) - pozitivne diagnostiko- 
vane ochorenie MCI. 

Clanok je rozdeleny na niekol’ko hlavnych casti. Uvod je venovany predstaveniu 
problemu, pouzitych metod a datovej vzorky. Druha cast’ je venovana navrhu systemu 
s ciel’om poskytnut’ lekarovi dolezite informacie pre podporu jeho rozhodovania. Zaver 
sumarizuje dosiahnute vysledky a nacrtava d’alsie kroky autorov v tejto problematike. 
Zaroven je potrebne spomenut’, ze autori tuto datovu sadu alebo jej podobne analyzo- 
vali aj pomocou inych metod dolovania v datach, napr. vybranych algoritmov pre ge- 
nerovanie rozhodovacich stromov. Vysledky tychto experimentov su prezentovane v 
clankoch [2, 3]. V tomto clanku bolo hlavnou motivaciou overit’ potencial vybranych 
statistickych metod pre prezentaciou dolezitych zisteni extrahovanych z dat v jednodu- 
cho pochopitel’nej forme koncovjun pouzivatel’om, t.j. lekarom. Jazyk R na tento ucel 
ponuka viacero moznosti, ktorych vysledkom moze byt’ jednoducho ovladany system 
na podporu rozhodovania. 

2 Navrh systemu na podporu rozhodovania 

Prototyp navrhovaneho systemu je vypoctovo zalozeny na metodach popisanych vys- 
sie, ktore su implementovane v jazyku R. Pouzivatel’ske rozhranie je implementovane 
prostrednictvom aplikacneho balika RShiny. Platnost’ dosiahnutych vysledkov je 
mozne overit’ prostrednictvom existujucej medicinskej literatury alebo na zaklade prak- 
tickych skusenosti spolupracujuceho experta. 
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2.1 Identifikacia kl’ucovych faktorov 

Na analyzu zavislosti medzi ciel’ovym a vstupnymi atributmi sme sa rozhodli pouzit’ 2- 
vyberovy Welchov t-test pre numericke atributy [6] a Pearsonov Chi-kvadrat test neza- 
vislosti pre nominalne atributy [6]. V prvom pripade sme si stanovili alternativnu hy- 
potezu HA, ktora vyjadruje, ze rozdiely medzi priemermi populacie (0/1) su rozne (za- 
vislost’ atributov). Knej nulta hypoteza (HO) vyjadruje, ze priemer populacie je rov- 
naky (nezavislost’ atributov). Najvyssie zavislosti sme identifikovali pre nasledovne 
numericke atributy: Vek (p-value=0.000017, hladina vyznamnosti 0.01), Alpha-2 glo¬ 
bulin (0.0458, 0.05), (0.0535, 0.1) a Skinf (0.0953, 0.1). Atribut Clear predstavuje 
dobru charakteristiku filtracnej kapacity obliciek; nizka alebo znizena hodnota zna- 
mena chronicke ochorenie obliciek. Atribut Skinf definuje hrubku koznej riasy na tri- 
cepse. Podobny postup sme pouzili aj pre nominalne atributy; nulova hypoteza v tomto 
pripade tvrdi, ze medzi dvoma nominalnymi atributmi nie je zavislost’. V mnozine 17 
atributov sme potvrdili zaujimavu zavislost’ na hladine vyznamnosti 0.1, t.j. na 90%, 
len v pripade Analg (liecba analgetikami). Tuto zavislost’ potvrdzuje aj Obr.l. 


2.2 Identifikacia hranicnych hodnot 

Hranicne hodnoty pre jednotlive vstupne atributy sme analyzovali pomocou ROC 
krivky a vypocet zlomoveho bodu Youden metodou [5], Tato metoda je bezne pouzi- 
vana na vyhodnotenie testov v biostatistike [7], 

/(c) = max { Sensitivity (c) + Specificity (c) — 1} (1) 

J - je funkciou navratnosti (senzitivity) a specificity; v optimalnej hodnote c je maxi- 
malna vertikalna vzdialenost’ medzi ROC krivkou a hlavnou diagonalou 



no 





yes 


Obr. I. Vizualizacia zavislosti medzi ciel’ovou diagnostikou MCI (y, 0/1) a vstupnym atri- 

butom Analg (x, yes/no) 

Na Obr.2 je zobrazeny graf distribucnej funkcie, ktora nam ukazuje, ze pacienti s pozi- 
tivnou MCI diagnozou su vo vyssom veku, ako v pripade zdravych l’udi. Zlomovy bod 
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v tomto pripade predstavuje vek 70.5 rokov, v ramci ktoreho bola dosiahnuta presnost’ 
klasifikacie 70.97%. 

Pouzivatel’ ma zaroven k dispozicii aj d’alsie metriky na vyhodnotenie uspesnosti 
urcenia zlomoveho bodu, okrem presnosti. Napr. parameter v angl. „true positive", t.j. 
kol’ko pozitivnych prikladov bolo v skutocnosti klasifikovanych spravne. Nizka hod- 
nota tohto parametra znamena, ze viacero negativnych pripadov bolo oznacenych ako 
pozitivne, cim generovali tzv. falosny alarm. V tomto pripade je potrebne vziat’ do 
uvahy naklady na potrebnu liecbu. 



Obr. 2. Vizualizacia distribucnej funkcie pre atribut Vek (x) a ciel’ovou diagnostikou MCI (y, 
0/1) s identifikovanym zlomovym bodom 


3 Zaver 

Ciel’om clanku bolo priblizit’ moznosti exploracnej analyzy a dolovania v datach pre 
jednoduche a efektivne pochopenie medicinskych zaznamov. Tieto zaznamy predsta- 
vuju dolezity zdroj infonnacii, pomocou ktorych diagnostikuje lekar prislusne ochore- 
nia. Casto je potrebne zvazit’ viacero symptomov meniacich sa v case a v roznych su- 
vislostiach. To otvara priestor pre vytvorenie komplexneho systemu na podporu rozho¬ 
dovania, ktoreho odporucania budu k dispozicii v graficky prijemnej a jednoducho po- 
chopitel’nej fonne. Priklady prezentovane vyssie predstavuju len ukazku; rovnake vi- 
zualizacie a popisne charakteristiky je mozne generovat’ pre vsetky vstupne atributy. 
Viac detailov o experimentoch a navrhnutom systeme obsahuje clanok podany do ca- 
sopisu „BMC Medical Informatics and Decision Making". 
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Annotation: 

Decision Support System based on simple and effective understanding of the available medical 
records. 

The medical records are an important source of information about patient’s health status, but they 
are often kept in a form which does not allow their effective management and their use for the 
purpose of the medical diagnosis. The aim of this work was to demonstrate a potential of the 
selected methods of the exploratory data analysis and data mining for a simple and effective 
understanding of the available medical records. For this purpose, we used a data sample from the 
Croatia, in which individual patients are characterised by a wide range of the parameters, rou¬ 
tinely collected and evaluated by the general practitioner. Based on this data, we tried to identify 
the key symptoms or relevant thresholds for MCI diagnosis. The system presents the extracted 
knowledge in an easily understandable even for the users with minimal knowledge of statistics 
or data analysis. 
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Abstrakt. V dnesni dobe, kdy mnozstvi informaci na internetu stale narusta, se 
automaticke zpracovani a trideni dat stalo velmi obllbenym oborem informacnich 
technologii. Jednou z oblasti je i internetove zpravodajstvl. Cilem tohoto projektu 
je nastroj pokryvajici cely proces pro zakladni analyzu clanku z ceskych zpravo¬ 
dajskych serveru. Projekt je zameren predevsim na extrakci relevantnich dat a je- 
jich analyzu. V prvni casti zahrnuje ale i souvisejici crawler, diky kteremu je 
mozne stahnout clanky k analyze ze zpravodajskych webu. V druhe casti je ze 
stazenych HTML stranek automaticky extrahovan relevantni obsah clanku a je- 
jich dalsi atributy. Treti casti je pak textova analyza vyuzivajici existujici postupy 
a nastroje, ktera se zamefuje na extrakci pojmenovanych entit a analyzu senti- 
mentu ceskeho textu. Nad vyslednymi strukturovanymi daty se lze dotazovat z 
ruznych pohledu a provadet tedy ruzne druhy experimentu. 

Typ prispevku: Aplikacni prlspevek 

Klicova slova: zpravodajske servery, text mining, pojmenovane entity, senti¬ 
ment 


1 Uvod 

V ceskem prostredi existuje mnoho serveru zabyvajicich se zpravodajstvim a kazdy z 
nich se ve svem obsahu mime list. Informace jimi generovane muze byt tezke analyzo- 
vat z nekolika duvodu. Clanky j sou umisteny na ruznych webovych serverech a vyhle- 
davat v nich je mozne pouze pomoci dostupneho vyhledavace. Dulezite casti clanku, 
hlavne jeho obsah, je pak obklopen dalsimi nezadoucimi prvky, jako je sablona webu 
nebo reklamy. Neni tedyjednoduche ziskat pouze relevantni informace. Krome toho je 
clanek psan v prirozenem jazyce a vetsinou k nemu nejsou k dispozici zadne strukturo- 
vane informace. 

Cilem tohoto projektu je vytvorit takovy nastroj [1], pomoci ktereho by bylo mozne 
ziskat z ceskych zpravodajskych clanku pouze relevantni informace a transformovat je 
do strukturovane podoby, ktera umozni naslednou analyzu zamerenou na moznosti sou- 
casnych metod pro textovou analyzu ceskeho textu. 
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2 Analyza zpravodajskych clanku 

Proces zlskavanl a analyzy dat je rozdelen do nekolika kroku. Predchazl jim vlastnl 
stazeni jednotlivych clanku ve forme HTML stranek z ruznych rubrikpro nekolikpred- 
nlch ceskych zpravodajskych serveru (Novinky.cz, iDnes.cz, Aktualne.cz a Parlament- 
niListy.cz), ktere je mozne realizovat pomoci libovolneho nastroje. Pro ucely tohoto 
projektu jsme vytvofili crawler umoznujlcl stazeni clanku z ruznych serveru die zvole- 
nych rubrik a casovych obdobi. 

2.1 Extrakce relevantfch informacf 

Z HTML extrahujeme pouze nadpis clanku, datum publikace, obsah a klicova slova, 
coz jsou prvky, ktere ma k clanku k dispozici vetsina ceskych zpravodajskych serveru. 
Metoda extrakce vyuzlva nekolika moznostl. V prlpade dostupnosti lze vyuzlt existujlcl 
explicitnl anotace (microdata). Pro ostatnl prlpady navrhujeme nasledujlcl heuristiku 
extrakce relevantnlch dat. 

Pro extrakci nadpisu vyuzlvame faktu, ze nadpis clanku je vzdy umlsten ve znacce 
hi. Protoze jich ale na strance muze byt vice, vyblrame nejlepsl shodu die Leveshtei- 
novy vzdalenosti se znackou title, ktera nadpis clanku obsahuje jako svou soucast. 

Datum publikace je ve strance nalezeno pomoci regularnlch vyrazu a je prihllzeno k 
pozici nalezeneho data na strance vzhledem k pozici nadpisu a obsahu. Casove infor- 
mace prllis vzdalene od hlavnlch sekcl jsou zahozeny. 

Algoritmus pro detekci vlastnlho obsahu stavl na principech existujlclch nastroju [6] 
a algoritmu [2], ktere prizpusobuje do oblasti zpravodajskych serveru [1] a castecne 
ceskeho prostredl s moznostl zobecnenl. Pracuje tak, ze v dokumentovem objektovem 
modelu stranky vyhleda textove uzly a odstranl ty, pro ktere odhadne na zaklade expe- 
rimentalne overenych parametru, ze obsahujl nerelevantnl obsah. V prvnlm kroku jsou 
odstraneny vsechny prvky nachazejlcl se nad nadpisem clanku. Pote jsou z DOM vy- 
brany vsechny textove uzly, pro ktere je nasledne vypocltana hustota odkazu. Uzly nad 
stanovene procento (napr. 70% textu je v odkazech) jsou oznaceny ke smazanl. Nasle- 
duje iterace pres vsechny vybrane uzly, kde jsou oznaceny ke smazanl dais! uzly podle 
toho, zda jsou sousednl textove bloky oznaceny ke smazanl ci nikoli. Pokud je stano- 
veny pocet za sebou jdouclch uzlu oznacen ke smazanl, je tak detekovan konec stranky 
a zbyle uzly jsou oznaceny ke smazanl. Pote se z DOM oznacene uzly odstranl. Cela 
stranka je nasledne procistena od uzlu, ktere neobsahujl zadny text a od dalslch nerele- 
vantnlch prvku, jako napr. tlacltka na socialnl site. 

Klicova slova se v HTML vetsinou nachazejl ve specifickych znackach (napr. se- 
znamy ul) se specifickjuni obsahy atributu (class, id...), jako jsou „tag“ ci „keywords“. 


2.2 Extrakce pojmenovanych entit 

Extrakce entit je zalozena na existujlclch programech NameTag [4] a MorphoDiTa [3], 
Extrahovany obsah clanku pouzlvame jako vstup do NameTagu, ktery v textu detekuje 
pojmenovane entity a vracl seznam entit ve tvaru, v jakem byly nalezeny v textu. Tyto 
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entity jsou pak prevedeny do zakladniho tvaru s vyuzitim jednoducheho principu opa- 
kovanych vyskytu v textech. Pokud je jeden z vyskytu jiz v zakladnim tvaru, vyuzije 
se i pro ostatni. Vyskyty kazde nalezene entity jsou dale seskupeny. Pokud se v entitach 
vyskytuje osoba oznacena jmenem a pfijmenim a dale pouze pfijmenim, jsou tyto ruzne 
tvary rovnez seskupeny do jedne entity. Pro kazdou entitu je pak zjisteno pomoci 
SPARQL dotazu, zda existuje jeji reprezentace na cs.dbpedia.org a pokud ano, je k 
entite prirazena jeji URI reference. 


2.3 Analyza sentimentu 

Pro analyzu sentimentu obsahu clanku byl pouzit opet program MorphoDiTa [3] a na- 
vic seznam ceskych emocne zabarvenych slov SubLex [5], Sentiment je v textu zjisten 
jednoduchou slovnikovou metodou, kdy jsou slova textu clanku prevedena na sva le- 
mata a ta jsou nasledne porovnana s emocne zabarvenymi slovy. Kazde nalezene pozi- 
tivni slovo ma prirazenu hodnotu 1, kazde negativni pak -1, vysledny sentiment je sou- 
cet nalezenych emocne zabarvenych slov. Urcuje se jednak sentiment celeho textu a 
jednak kazde jeho vety. 


Nejvysl<ytovanejsf entity 
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Obr. 1. Nejvyskytovanejsi entity, brezen 2016, iDnes.cz 


3 Evaluace 

Dilci casti celeho procesu jsme evaluovali na celkem sestnacti ceskych zpravodajskych 
serverech a rucne anotovanych 44 clancich. Nadpis, datum a klicova slova jsou extra- 
hovany s plnou uspesnosti. Samotny obsah je extrahovan s prumernou precision 0,993 
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a recall 1,0. Extrakce entit byla overena na vybranych 20 clancich s rucni anotacl. Z cel- 
kovych 772 vyskytu bylo spatne prirazeno ci rozpoznano 23 z nich (priblizne 3%). 
Analyzu sentimentu jsme overili na 32 rucne anotovanych clanclch, kde doslo k ne- 
shode v 7 pripadech (priblizne 21,9%). Vysledky jsou velmi ovlivneny svym malym 
rozsahem, samotnou kvalitou nastroju NameTag, MorphoDita, SubLex a v pripade ana- 
lyzy sentimentu pohledem anotatora. 

4 Experimenty 

Experimenty j sme provedli na ctyrech ceskych zpravodajskych serverech za obdobl 
leden az duben 2016 (pres 17 tisic clanku). Na Obrazku 1 a Obrazku 2 se nachazi pri- 
klad vystupu. Pro kazdy den jsou v grafu (Obrazek 1) entit zobrazeny tri nejvyskytova- 
nejsi entity ze vsech clanku v danem dni za obdobl brezen 2016 ze serveru iDnes.cz. V 
grafu je videt, ze se v celem mesici nachazeji jednak entity, ktere se pomerne casto 
opakuji (Turecko, Milos Zeman, Evropska unie, Andrej Babis...), a jednak entity, ktere 
se v mesici prilis nevyskytuji. Pomoci tech je mozne detekovat zajimave udalosti. Z 
grafu je napr. mozne vycist, ze 5. a 6. 3. 2016 jsou casto vyskytovane entity Slovensko 
a Robert Fico, coz je z toho duvodu, ze 5. 3. se na Slovensku konaly volby. Dale 22. a 
23. 3. 2016 je znat velmi casty vyskyt entity Brusel, coz souvisi s teroristickymi utoky 
v Bruselu v prvni zmineny den. Rovnez v obdobl 29. az 30. 3. je patmy vyssi vyskyt 
entity Cina, coz koresponduje s navstevou cinskeho prezidenta v CR vtomto obdobi. 
Na grafu sentimentu (Obrazek 2) je zobrazen prumerny pocet bodu sentimentu na jeden 
clanek ze vsech clanku, ktere v dany den vysly. Nejvyraznejsi je silna negativni vlna 
spojena s teroristickym litokem v Bruselu pocinaje 22.3.2016. 


S Sentiment clanku po dnech, iDnes.cz, 2016-03-01 az 2016-03-31 



Negativni' —•— Pozitivni 


Obr. 2. Sentiment clanku (prumerbodu na clanek) po dnech, brezen 2016, iDnes.cz 
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5 Zaver 

V projektu jsme vytvorili nastroje umoznujlcl zlskanl clanku z ruznych ceskych zpra¬ 
vodajskych serveru, extrakci duleziteho obsahu, jeho zakladnl textovou analyzu. Vy- 
stupy jsou ulozeny ve strukturovane podobe, ktera umoznuje provadet ruzne druhy 
pohledu a vystupu. Hlavnlm pflnosem je upraveny algoritmus extrakce relevantnlch 
informacl a aplikace existujlclch metod pro analyzu zlskanych dat. Prace byla experi- 
mentalne overena na ctyrech prednlch zpravodajskych serverech a dllcl nastroje byly 
evaluovany na datech poskytnutych anotatory. 
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Annotation: 

Analysis of Czech news articles 

Nowadays, when the amount of information on the internet continues to grow, automatic pro¬ 
cessing and analysis of data has become a very. Online news service is one of the domains in 
which a significant amount of diverse as well as similar information exists. The goal of this work 
is to create a tool for analysis of Czech news articles. The first part is a crawler which allows 
downloading articles for analysis from news servers. In the second part, relevant content of arti¬ 
cles and their other attributes are extracted from downloaded HTML pages. The third part is 
a text analysis for which modules for extraction of named entities and for sentiment analysis of 
Czech texts have been created. We performed experiments on four of the most visited Czech 
news portals. The results show that the presented approach is suitable for analysis of news arti¬ 
cles. 
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Abstract. Innovative field of computational creativity is focusing on developing 
algorithms that are capable of creating outputs that would be considered creative. 
Language provides a lot of opportunities for creativity, so the article describes 
and compares two different approaches to generating haiku poetry. First approach 
proposes using evolutionary algorithm and human as a fitness function in the 
evolution. Second described approach composes poems based on haiku models 
that were extracted from haiku database. The goal is to create poems, considered 
by humans as understandable and with aesthetic value. 

Contribution type: Work-in-progress paper 

Keywords: computational creativity, haiku poetry, natural language generation, 
poetry generation 


1 Introduction 

Computational creativity is dedicated to studying and creating systems that can be con¬ 
sidered creative. Its goal is to study human creativity and make systems that are capable 
of creating outputs that would be considered creative, if the same output were produced 
by human. This article proposes and compares two approaches for computational crea¬ 
tivity algorithms. Both approaches aim to create haiku poems. 

Haiku is a genre of poetry that has its origin in Japan, but was also adapted to other 
languages as well. This article describes generation of haikus in English language. Tra¬ 
ditional haiku poem consists of three lines with fixed syllable count for each verse (5- 
7-5 syllable pattern). Regarding content, main theme of haiku poem is nature and it 
aims to capture a feeling. 

First approach leverages interactive evolutionary computation, artificial intelligence 
method combining evolutionary algorithms with human as a fitness function to achieve 
personalised outputs created and modified during evolution. 
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Second approach aims to implement certain level of knowledge about haiku poems, 
mostly its structure and content. It relies on creating models of poems from large haiku 
poems created by human authors. 

Analysis and description of several related works can be found in section 2. Next 
sections are dedicated to description of the proposed approaches for haiku poetry gene¬ 
ration, as well as listing of several example outputs. Last part of the article compares 
both proposed approaches and describes further work. 

2 Related works 

Some of existing applications of computational creativity with focus on generating po¬ 
etry are described in this section. 

2.1 Poetry Generation 

System POEVOLVE [3] generates poems using evolutionary computation with evalu¬ 
ation function implemented as a neural network. This neural network was trained on 
data obtained from human evaluations of poems and it is used to solve problem with 
human fatigue caused by evaluating lots of individuals. This approach also helps with 
speeding up the poem creation process. Using interactive evolution in haiku generation 
was an inspiration for first described approach in this article. 

System called WASP [4] generates different genres of poetry. From input data, 
which consist of a set of words and a set of reference verse patterns, it creates poems 
by using the words sent to the system as input. The output satisfies constraints of input 
reference patterns. 

Another interesting poetry generator is system called Tra-La-La Lyrics [5], Input to 
the system is a melody. The system creates poem as a lyrics for the input melody. The 
poem gereration is based on observation that strong beats in the song are associated 
with the lexical stress in the words. It is very interesting application for creation of song 
lyrics. 

2.2 Haiku Generation 

Many haiku generators can be found on the internet. Poetry engine 1 is an example of 
such freely available generator. Its outputs are just random selection of words, such 
poems rarely have meaning and are made to entertain users. 

Generator 2 is another example of poetry engine available through web. To generate 
poem, it uses pre-defmed models of sentences to create haiku poems. Words are rando¬ 
mly chosen from thematic dictionary based on its part of speech. Using models and 


1 Random Haiku Generator, [online, cited 10.09.2016], Available at 

<http://www.randomliaiku.com/> 

2 Peter’s Haiku Generator, [online, cited 10.05.2016], Available at <http://peterhoward.org/hai- 

kugen/framsetl ,htm>. 
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having a database of words commonly used in haiku poems inspired second described 
approach to haiku generation in this article. 

3 Haiku Poetry Generation with Interactive Evolution 

Evolutionary algorithm is used to create new poems and leverages human evaluation 
as fitness function. The proposed approach [1] was implemented as web application. 

Generation of haiku poems with interactive evolutionary computation was chosen 
because the user’s subjective preference is the most important task when generating 
natural language, especially poetry. 

10 users participated in experiment that was carried out to evaluate the performance 
of the application. 50% of the participants were satisfied with poems in final generation. 

3.1 Haiku Evolution 

The source for creating initial population is haiku corpus. It is a database consisting of 
haiku poems created by human authors. AhaPoetry 3 and DailyHaiku 4 were used as the 
source for the corpus. 

Verse is considered the basic element of the poem, so each individual is structured 
as a collection of 3 verses. Each population contains 10 individuals. Population size 
was chosen experimentally - several experiments with real users were carried out and 
their opinion was taken into account when choosing the size of population for evolution. 

The number of evaluations that IEC can receive from one human user is limited by 
user fatigue. This is a big disadvantage of interactive evolution, because smaller search 
space is explored. 

At first, every poem in population has the same fitness value which is modified based 
on human evaluation. Then, haiku poems with positive and neutral feedback from hu¬ 
man evaluator are chosen to reproduce. Cross-over with two and three parents are ge¬ 
netic operators designed to create new individuals in reproduction stage of evolutionary 
cycle. 

3.2 Example Outputs 

Several poems generated by system: 
picking wildflowers 
the early spring sun 
in my hand 

cherry blossoms 
the ant carries only 


3 Aha Poetry, [online, cited 10.05.2016], Available at <http://www.ahapoetry.com/aadoh/h_dic- 

tionary.htm>. 

4 DailyHaiku. [online, cited 10.05.2016], Available at < c http://www.dailyhaiku.org/>. 
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its papery leaves 

a winter bracken 

in full bloom he photographs 

his fat wife 


4 Haiku Poetry Generation with Poem Models 

Haiku generation with poem models constructs poems from words based on their part 
of speech and number of syllables. It was designed to avoid the problems in interactive 
evolution approach. The approach was implemented as web application. 

No experiments have been conducted to evaluate performance of the application as 
of yet. The outputs were compared to outputs generated by the interactive evolution. 
The comparison can be found in section 5. 

This approach [2] ensures that final poem will conform to required syllable pattern 
by using syllable counting algorithm during poem creation. With using dictionary ex¬ 
tracted from haiku poem, it aims to create poems with haiku-specific words and by this 
to take into account also content criteria. 

4.1 Haiku Generation 

Word is considered the basic element of the poem and poem is constructed from words 
based on word metadata. Pattern for word selection (to fill the model with words) is 
defined by poem model. 

In data preparation phase, dictionary is created. For creating dictionary, haiku corpus 
(consisting of the same haiku poems as haiku corpus used in interactive evolution of 
poems) is used. 

Dictionary consist of all words from poems in haiku corpus. Each word in dictionary 
is defined by 3 properties: 

• word itself 

• part of speech (metadata) 

• syllable count (metadata) 

Haiku corpus is also used for poem model extraction. Poem model consists of list of 
parts of speeches and list of syllable counts. Poem model is constructed from every 
poem in haiku corpus. Unique haiku models with frequency (occurence number) 5 are 
kept as haiku specific. Frequency constant was chosen experimentally. 

Words from dictionary are selected into the poem based on part of speech and 
number of syllables. The poem is displayed for evaluation by human user to determine 
the performance of the system. 
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4.2 Example Outputs 

Several poems generated by system: 
always helpless 
nearby responsibly shrivels 
a greedy banyan 

in flawless garden 
mysterious dahlia 
quietly complains 

something which blossoms 
conspire according cow 
diy shadow outside 

5 Comparison of Proposed Approaches 

Noticable problem of poems generated with interactive evolution is that not all of the 
poems conform to 5-7-5 syllable pattern. The reason is that not every poem in haiku 
corpus follows the syllable pattern and it reflects into generated poems as well. To avoid 
issue with not conforming to formal haiku criteria, approach for generating haikus using 
poem models was proposed. To make sure that syllable count rule is followed, it takes 
into account syllable number in process of poem creation. 

Both approaches create haiku poems that contain vocabulary related to nature or 
vocabulary used to express emotion. The haiku content is provided by using large cor¬ 
pus with haiku poems written by human authors. 

Sometimes, both systems create less meaningful poems. In case of interactive evo¬ 
lution, this happens when two poems with different topics and/or with different emotion 
are selected to cross-over. As for poem models, the system does not have any further 
knowledge on how to select words from dictionary into poem, so that it would choose 
words related by sentiment and topic. 

6 Conclusion 

When comparing generated poems of both systems, generating haiku poetry with poem 
models creates better poems from the formal point of view and also in terms of poem 
content. The reason is that it is easier to combine words than whole verses 

The application creating poems by using poem models will be later made accessible 
to wide public in order to evaluate and test the performance of poetry generator with 
real users and continuously improve it. 
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Abstrakt. Emocionalne slova hraju vyznamnu ulohu v procese klasifikacie na- 
zorov a analyzy sentimentu. Ich najdenie a identifikacia v texte je preto vel’mi 
dolezita. Po faze vyhl’adavania nasleduje faza ich spracovania. Pre zjednodusenie 
spracovania a zlepsenie vysledkov je vhodne pouzif urcitu formu upravy tychto 
slov, ci uz pomocou stemmovania alebo lematizacie. Tato praca je venovana vy- 
tvoreniu slovenskeho stemmera pre emocionalne slova, ktore su nasledne pouzite 
pre klasifikaciu emocil. Stemmer je zalozeny na gramatickych pravidlach sloven¬ 
skeho jazyka a dokaze na zaklade zadefinovanych prlpon, predpon a pravidiel 
previesf slovo v akomkofvek morfologickom tvare na jeho kmeiiovy tvar bez 
toho, aby vedel, o aky slovny druh ide. Medzi slovne druhy, ktore je stemmer 
schopny spracovaf patria podstatne mena, pridavne mena a slovesa nesuce aku- 
kol’vek emociu. Na testovanie sme pouzili slovnik obsahujuci 17 872 vysklono- 
vanych slov. Stemmer dosiahol celkovu presnosf 98,1%. 


Typ prispevku: Vyskumny prispevok 

Kl’iicove slova: stemmer, predpona, prlpona, slovensky jazyk, kmen slova 


1 Uvod 

Kazdy z nas prichadza kazdy den do kontaktu s intemetom, ktory pouzivame na komu- 
nikaciu, zabavu, pracu, ci vyhl’adavanie informacii. V dnesnej dobe sa na intemete na- 
chadza vel’ke mnozstvo dat, ktore je zlozite manualne prehl’adavaf. Z tohto dovodu sa 
dostavaju do popredia mechanizmy sluziace na ziskavanie a vyhl’adavanie informacii. 
Prave v tejto oblasti sa pouzivaju stemmovacie algoritmy, ktore maju najvacsie uplat- 
nenie najma vo webovych vyhl’adavacoch, ktorym je napriklad Google. Okrem webo- 
vych nastrojov je mozne tieto algoritmy vyuzit’ napriklad pri spracovani prirodzeneho 
jazyka. A prave pri pocitacovom spracovani je slovensky jazyk vel’mi zlozity. Sloven- 
cina patri k jazykom s bohatou morfologiou, co znamena, ze slovo v texte mdze mat’ 
rozne tvary, a toto je hlavny dovod, preco slovensky jazyk v porovnani s inymi jazykmi 
znacne zaostava. Vel’ke mnozstvo vynimieka tvarov slov, vznikajucich pri sklonovani, 
vyrazne komplikuje pracu so slovenskym textom. Z toho dovodu je vel’mi narocne vy- 
tvorit’ algoritmus, ktory prevedie jednotlive slova na ich zakladny tvar. 
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2 Problematika stemmovania v slovenskom jazyku 

Existuje niekol’ko druhov stemmovaclch, respektlve lematizacnych algoritmov, ktore 
pracuju na roznych princlpoch a dosahuju rozne vysledky. Tieto algoritmy maju rozne 
vyhody a nevyhody. Niektore dosahuju dobre a kvalitne vysledky, ale su casovo a vy- 
konnostne narocne. Ostatne dosahuju menej kvalitne vysledky, ale ich vyhodou je 
prave rychlost’ spracovania vysledkov [1], 

Slovensky stemmer podstatnych mien bol implementovany do vyhl’adavacieho sys- 
temu Lucene. Stemmer bol inspirovany pravidlami pre rusky stemmer. Ide o program, 
ktory previedol vysklonovane podstatne mena na ich zakladny tvar, na zaklade odstra- 
novania prlpon. Tato aplikacia dosahovala uspesnost’ priblizne 90 %. Stemmer bol tes- 
tovany na dvoch clanok a vystupom bob dvojice, a to slovo a jeho koren. [2] 
Slovensky stemmer slovenskych priezvisk a nazvov ullc bol taktiez implementovany 
do Lucene. Tento stemmer dokaze odstranit’ prlpony, ktore obsahuju slovenske prie- 
zviska a nazvy ullc. Aplikacia bola testovana na 70 slovach obsahujucich nazvy ullc a 
priezviskach v roznych tvaroch. Tento stemmer dosahoval takmer 99 % uspesnost’, pri- 
com nespravne vyhodnotil iba jedno meno cudzieho povodu. [3] 

Tvaroslovnlk je program, ktory bol vytvarany na UPJS v Kosiciach. Ide o databazu, 
ktora obsahuje 30 000 000 tvarov slovenskych slov. Tieto slova su ulozene v databaze 
formou textovych suborov. Kazde slovo obsahuje zaznam o tom, o aky slovny druh ide 
a ake su jeho gramaticke kategorie. Tieto subory boli ulozene do databazy, elm bolo 
mozne pouzit’ tento program na lematizaciu alebo na zlskanie vsetkych tvarov slov pre 
dane slovo. Program pouzlva na stemmovanie predlohu, cize na zaklade porovnavania 
slova a predlohy hl’ada zakladny tvar pre dane slovo. Tak ako kazdy z lematizatorov 
slovenskeho jazyka, ani tento nie je bezchybny. Ked ze textove subory, ktore boli vlo- 
zene do databazy, vytvaralo niekol’ko studentov a nie vsetci pristupovali k svojej praci 
zodpovedne, z tohto dovodu sa v tejto praci vyskytlo aj mnozstvo chyb. Rychlost’ le- 
matizacie dosahuje v priemere 132 slov/sekunda. Tato databaza vsetkych tvarov slov 
sa casto vyuzlva aj v mnohych d’alslch projektoch. [4] [5] 

3 Stemmer pre klasifikaciu emocii 

Slovensky stemmer pre klasifikaciu emocii pouzlva na stemmovanie emocionalnych 
slov algoritmus odstranovania prlpon a predpon. Ked’ze slovencina patrl k jazykom s 
bohatou morfologiou, casto v nej dochadza ku tvorbe vynimiek. Nakol’ko vacsina emo¬ 
cionalnych slov ma pravidelne stupnovanie, bol pre nas problem ohl’adom vynimiek 
zanedbatel’ny. Medzi vyhody tohto prlstupu patrl vysoka presnost’, pri aplikovanl 
spravnych pravidiel a rychlost’. Tento stemmer bude nasledne implementovany do al- 
goritmu na analyzu sentimentu, kde by mal vyrazne urychlit’ dobu klasifikacie oproti 
stemovaniu zalozenemu na slovnlku. 

Pri tvorbe prlpon sme postupovali sposobom, pri ktorom sme najskor zlskali vsetky 
prlpony vznikajuce pri sklonovanl prldavnych mien, podstatnych mien a pri casovanl 
slovies. Tieto prlpony sme zlskali z Pravidiel slovenskeho pravopisu. Prlpony sme spo- 
jili a odstranili duplicitne. Nakoniec sme zlskali pole koncoviek, ktore obsahuje 130 
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pripon (Tab 1). Z predpon sme vybrali najma predpony cudzieho povodu. Niektore 
z nich sluzia pri tvorbe slov skladanim, napriklad dobro-srdecny. Tab. 1 znazornuje aj 
13 prepon, ktore v pripade, ak sa nachadzaju v slove, tak budu pri stemmovani odstra- 
nene. 


Tab 2. TabuFka obsahujuca vsetky pouzite predpon a pripon. 


PREDPONY 

dis-, kilo-, homo-, mega-, malo-, polo-, seba-, poly-, hypo-, ultra-, infra- 
, dobro-, hyber- 

PRIPONY 

encoch, -encami, -ujete, -ujeme, -ovalo, -ovali, -ovala, -eniec, -encom, 
-atami, -atach, -ujuc, -ujte, -ujme, -ujes, -ujem, -ovia, -ovat, -oval, -iete, - 
iemu, -ieme, -ieho, -iami, -ialo, -iali, -iala, -iach, -ence, -ejuc, -ejte -ejme, 
-atom, -atam, -ajuc, -ajte, -ajme, -ymi, -ych, -ulo, -uli, -ula, -uju, -uje, -ovi, 
-och, -ite, -iou, -iom, -imi, -ime, -ilo, -Hi, -ila, -ich, -iev, -iet, -ies, -ien, - 
iem, -iel, -iej, -iat, -iam, -iac, -ete, -emu, -erne, -elo, -eli, -ela, -eju, -eho, - 
aty, -atu, -ati, -ate, -ata, -ami, -ame, -alo, -ali, -ala, -aju, -ach, -ym, -lit, - 
us, -ul, -uj, -uc, -te, -ov, -ou, -om, -ol, -ok, -mu, -mi, -me, -lo, -li, -la, -iu, - 
it, -is, -io, -im, -il, -ii, -ie, -ia, -ho, -es, -en, -em, -el, -ej, -at, -as, -am, -al, - 
aj, -ac, -y, -u, -o, -i, -e, -a. 


Algoritmus na stemmovanie emocionalnych slov, sa sklada z nasledujucich cast!: 

Vlozenie a predspracovanie textoveho suborn, respektive slova: 

• Vlozenie a nacitanie slova alebo textoveho suborn. 

• Konverzia vel’kych pismen na male a odstranenie diakritiky. 

• Algoritmus osetruje dlzku slova a v pripade, ak sa slovo sklada z 3 pismen a sucasne 
konci na niektoru zo spoluhlasok: ,,-j, -l, -m, -s, -t, -v, -z“, bude o tom pouzivatel’ 
oboznameny a algoritmus vypise na vystup vysledne slovo v povodnom stave inak 
pokracuje d’alej. 

Odstranenie predpon: 

• Ak slovo zacina na predponu ,, dis-", tak tato predpona bude odstranena. 

• Akje dlzka slova vacsia ako 4 pismena a zacina na niektoru z predpon: „homo-, 
kilo-, malo-, mega-, polo-, seba-, poly-, hypo- ", tak pripona bude odstranena. 

• Akje dlzka slova vacsia ako 5 pismen a zacina na predponu: „dobro-, infra-, ultra- 
, hyper-", tak bude tato predpona odstranena. Dlzka pri predponach sa osetruje 
z toho dovodu, ze predpona moze byt’ odstranena aj vtedy, ked’ znazornuje samotne 
slovo. Napriklad predpona dobro-, moze znazornovat’ aj samotne podstatne meno 
dobro v nominative a v pripade, ak by nebola osetrena dlzka slova, toto slovo by 
bolo odstranene. 

Odstranenie pripon: 

• Osetrovanie dlzky slova a nastavenie dlzky prehl’adavanej pripony. 
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• V pripade, ak je dlzka slova vacsia ako 7 pismen, algoritmus nastavl zaciatok pre- 
hl’adavanej pripony ako rozdiel dlzky slova a najdlhsej prlpony. To znamena, k= 
dlzka slova - 6. Naprlklad, ak sa slovo sklada z 8 pismen, tak k vypocltame ako 8- 
6=2, a v takomto pripade zacne slovo prehl’adavat’ od 3. plsmena. 

• V pripade, ak je slovo kratsie ako 7 pismen, premenna k bude nastavena na 2. Je to 
z toho dovodu, ze slova kratsie ako 2 plsmena nebudu vobec stemmovane. 

• Zaciatok cyklu for, ktory sluzi na prehl’adavanie slova a porovnavanie prlpon. 

• Algoritmus prehl’adava slova od najdlhsej moznej prlpony a porovnava jus vopred 
zadefinovanymi prlponami (pole koncoviek). 

• V pripade, ak bola najdena zhodna prlpona, algoritmus ju odstrani, inak pokracuje 
v prehl’adavanl a porovnavanl prlpon. Ak sa nenajde ziadna zhodna prlpona, bude to 
znamenat’, ze slovo nema prlponu a predstavuje uz hl’adany kmenovy tvar. 

Odstranenie stupnovania: 

• V tomto kroku sa osetruje 3. stupen prldavnych mien, to znamena, ze ak slovo po 
odstranenl predchadzajucej pripony koncl na „ -s “ a zaclna predponou ,, naj- ", algo¬ 
ritmus tuto predponu odstrani a informuje o tom pouzlvatel’a. 

• V pripade, ak slovo koncl na prlponu ,,-ejs“, bude aj tato prlpona, pouzlvana v 2. 
a 3. stupni prldavnych mien, odstranena. 

4 Testovanie a vyhodnotenie aplikacie 

Aplikacia slovensky stemmer pre klasifikaciu emocil bola testovana prostrednlctvom 
slovnlka sentimentalnych slov. Slovnlkobsahuje 17 872 slov, ztoho 2 473 podstatnych 
mien, 11 293 prldavnych mien a 4 106 slovies. Ostemovane slova boli manualne po- 
rovnavane s kmenmi urcenymi expertom. Na vyhodnotenie presnosti sme pouzili 

TP 

vzorec na vypocet presnosti, konkretne sme vychadzali z tohto vzorca: p = Tp+Fp ^ » 

kde TP, predstavuje spravne ostemmovane slova aplikaciou a FP znazornuje nespravne 
ostemmovane slova aplikaciou. Presnosti pre jednotlive slovne druhy, ako aj celkova 
presnost’ su poplsane v tabul’ke Tab. 2. 


Tab. 2 Tabul’ka obsahujuca presnost’ stemovania pre jednotlive slovne druhy. 


Slovny druh 

ostremovane 

slova 

neostemovane 

slova 

vsetky slova 

presnost’ (%) 

podstatne 

mena 

2390 

83 

2473 

96,64 

pridavne 

mena 

11137 

156 

11293 

98,61 

slovesa 

4011 

95 

4106 

97.68 


Tato vysoka presnost’je podl’a nasho nazoru sposobena odstranovanlm diakritiky hned’ 
na zaciatku stemmovania. Kedze pri sklonovanl slov dochadza ku zmene diakritickych 
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znamienok v poslednej slabike slova, tak v pripade, ak by sme sa zaoberali aj diakriti- 
kou, tento stemmer by dosahoval podstatne nizsiu presnost’. 

Dalslm z dovodov, preco je tato presnost’ taka vysoka, moze byt’ aj orientacia 
stemmera na emocionalne slova. Je mozne, ze v pripade implementacie tohto algoritmu 
na neutralne slova by mohlo dojst’ k znlzeniu presnosti, a to z dovodu, ze podstatne 
mena, v ktorych dochadza ku vynimkam v sklonovanl, predstavuju najmensiu cast’ z 
testovacej mnoziny a nachadza sa tarn zanedbatel’ny pocet slov, ktore tvoria vynimku 
(zlo —> ziel, neha —> nieh). 

5 Zaver 

Stemmer, vytvoreny v tejto praci, predstavuje druh stemmeru zamerany na stemmova- 
nie emocionalnych slov. Stemmer funguje na princlpe odstranovania prlpon a predpon, 
a to na zaklade pravidiel slovenskeho jazyka. Od ostatnych stemmerov sa odlisuje 
najma pouzitlm predpon, zahfna predpony, ktore sa vyskytuju najma v cudzlch slovach 
vyjadrujucich emociu, prlpadne pri inych emocionalnych slovach, naprlklad aj pri slo¬ 
vach tvorenych skladanlm, prlkladom je slovo dobrosrdecny. Aplikacia dosahla vel’mi 
dobre vysledky, ktore boli testovane na 17 872 emocionalnych slovach. Po otestovanl 
sme zistili, ze aplikacia dosahuje priememu presnost’ 97,64%. Ide o vysoku presnost’, 
ktora moze byt’ sposobena odstranovanlm diakritiky, orientaciou na emocionalne slova 
a testovacia mnozina s maljm poctom slov, reprezentujucich vynimky v slovencine. 

Pod’akovanie. Tento prispevok vznikol s podporou Vedeckej grantovej agentury Mi- 
nisterstva skolstva, vedy, vyskumu a sportu Slovenskej republiky v ramci projektu c. 
1/0493/16 „Metody a modely pre analyzu prudov dat“. 
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Annotation: 

Slovak stemmer for emotional words 

The emotional words have important role in opinion classification and sentiment analysis. It is 
very important to find and identify them. After identification of these emotional words, it is very 
important to process them correctly. We can use stemming or lemmatization to process them. 
This paper is focused on creation Slovak stemmer for emotional words, which can be used for 
opinion classification. Our stemmer is based on grammatical rules, it can remove prefixes and 
suffixes and it can find stem of the word. We can process mainly adjectives, nouns and verbs 
which contain emotions. The stemmer was tested on 17 872 words and achieved accuracy 98,1%. 
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Abstract. Aviation safety reports start to play a crucial role in understanding in¬ 
cidents and accidents in the aviation safety field. Automated text processing is 
necessary for simplification of the safety reporting process.This task can be ache- 
ived in diffe-rent ways, such as statistical, non-statistical or a combination of 
these techniques. In this paper, we are mainly focusing on non-statistical ones, 
by introducing our text processing scenario. We start with indexing of various 
aviation safety vocabularies that we are using as a backbone for this task. Next, 
the golden stan-dard corpus is prepared, including the testing process of several 
Linked Data Knowledge Extraction tools, with respect to a domain-specific 
vocabulary. Then, choosing the most accurate entity annotation tools and making 
them work toget-her, as well as with other features that we added, taking into 
consideration some very specific terms and abbreviations used in aviation field. 
The ultimate goal is to build a tool that will integrate several techniques inside in 
order to provide high precision reports’ annotations in aviation safety domain. 

Contribution type: Research paper 

Keywords: text analyzing, ontology, aviation safety 


1 Introduction 

Initial incident and accident reports are the best sources of information for extracting 
the most important knowledge to feed the preliminary 1 reports’ building process. Initial 
reports are usually a free-form text, describing the incident or the accident, along with 
a small set of metadata (mostly concerned with the time, the location and the equipment 
involved [1], The automatic analyzing process of such reports is challenging, because 
they are usually short, and they contain a lot of aviation-specific terms and abbrevia¬ 
tions. Recently, some entity recognition tools have appeared. As mentioned in [2], if a 
custom vocabulary can be loaded (configured or programmed) into the tool, it signi¬ 
ficantly improves the recognition of the entities. For that, the selection of the tools was 
focused only on the ones that are combining both Natural Language Processing (NLP) 


1 Preliminary report is created by safety department of an organization and sent to the autho¬ 
rity. 
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and Linked Data capabilities, and allow building custom indexes depending on the do- 
main-specific vocabularies. 

This paper describes our experience gathered during analysis of aviation safety re¬ 
ports in the Czech environment, as well as evaluation of our pipeline on selected repo¬ 
rts. 

2 Description of the corpus 

In order to evaluate the pipeline, we had to create gold standard corpus. It mainly con¬ 
sists of initial safety reports in the aviation-safety domain. Experts in aviation domain 
manually annotated domain terms (entities) in each report with respect to huge control¬ 
led domain-specific vocabularies. Technically, they used the General Architecture for 
Text Engineering (GATE) tool 2 . We need this kind of corpus for the evaluation process 
of the tool, as well as augmenting our aviation ontology 3 with more terms and relations 
in our future work. 

3 Aviation Safety Text Analyzing Tool 

The text processing helps in building the preliminary safety report based on the initial 
ones. The initial reports usually contain very basic information (written in a natural 
language) about the specific accident or incident, such as the safety occurrence partici¬ 
pants, place, time, etc.. 

For better text understanding and entity recognizing, many techniques were introdu¬ 
ced. Some are ontological-based, where the ontologies and the other knowledge reso¬ 
urces are widely used to aid the recognition in special domain texts, besides by the 
linkage from the text back to the ontologies, we can achieve better understanding and 
gain additional knowledge. Also the statistical-based entity recognition models with its 
various algorithms can overcome some of the shortcomings of the other tchniques. 

In order to detect entities in such reports, several entity recognition tools were tested. 
We described a portfolio of objects and events or artifacts that are important for the 
safety reporting. We are showing the roadmap how the task of detecting the most im¬ 
portant information in text works and make use of it in the aviation safety reporting 
tool. These tools are mentioned in the next paragraphs. 

3.1 Apache Stanbol 

Apache Stanbol provides the ability to work with custom vocabularies and creating 
custom indexes upon it, which is necessary for being able to detect various types of 
entities, and to detect and work with concepts from a specific domain [3], It also comes 
with a list of enhancement engines implementations, with the ability to build a specific 


2 https://gate.ac.uk/ 

3 https://www.inbas.cz/aviation-safety-ontology 
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one to get the most benefit out of the tool [4], This allowed us to build a chain of en¬ 
hancement engines that fits perfectly to the aviation-safety concepts detection. 

3.2 DBpedia Spotlight 

DBpedia Spotlight offers three basic functions, Annotate, Disambiguate and best K- 
Candidates. It can be accessed from a REST Web Service and from a user interface on 
the Web [5], It also offers creating a Spotlight model on the user's own server through 
an internationalization process, to model occurrences of resources with the context in 
which they have been mentioned. 

Indexing process and building a customized index according to the aviation ontology 
is an intensive task with DBpedia Spotlight. It needs extra efforts to extract surface 
forms and valid URIs from the gold standard corpus and then, build the dictionary- 
based spotter from them [6], 


3.3 Customized techniques 

Some of the artifacts that we defined are hard to be detected by the previously men¬ 
tioned tools, in spite of their indexing capabilities and their ability to detect mentions 
from the specific terminology. For these specific terms, we are using different detection 
techniques using the advantage of pre-knowledge of its rules. For detecting aircraft call 
signs, for instance, regular expressions are used, taking into consideration the rules of 
different possible formats for call sign representation [7], 

The output of Apache Stanbol, DBpedia Spotlight and the customized techniques were 
parsed, merged and optimized in a RESTful web service. The service outputs the enti¬ 
ties that are detected, with their proper mapping to the aviation ontology. 


In-flight shutdown, flight 
no. TVP7266, KTW-BOJ. 
Emergency landing in 
SOF. ENG 2 LOW OIL 
PRESS, low quantity, 
FADEC automatically stop 
the engine. 



Plaintext report 


DBpedia Spotlight 

flight no 


ENG 2 LOW OIL PRESS 

Apache Stanbol 

engine_ 


FADEC 

Customized Techniques 

ABC1234 


' KTW-BOJ 


RESTful web service 


In-flight shutdown, flight 
no. ABC1234, KTW-BOJ. 
Emergency landing in 
SOF. ENG 2 LOW OIL 
PRESS, low quantity, 
FADEC automatically stop 
the engine. 



Annotated report 


Fig. 1. The processing pipeline 
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4 The evaluation process 

A testing framework was developed to achieve the evaluation task. The main process 
is to parse the output of the tool and compare it to the expected annotations in the 
corresponding document in the gold standard. 

The statistics that were observed were basically the True Positives (TP), where the 
cases were positive and predicted positive, the False Positives (FP), where the cases 
were negative but predicted positive, and the False Negatives (FN), where the cases 
were positive but predicted negative. These statistics are then used to calculate the usual 
precision, recall and the FI measures. 


Table 1. Evaluation results for samples of reports 



Precision 

Recall 

FI 


With cus¬ 
tom 

vocabu¬ 

lary 

Without 

Custom 

vocabu¬ 

lary 

With cus¬ 
tom 

vocabu¬ 

lary 

Without 

Custom 

vocabu¬ 

lary 

With cus¬ 
tom 

vocabu¬ 

lary 

Without 

Custom 

vocabu¬ 

lary 

Report 1 

0.5 

0 

0.095 

0 

0.160 

0 

Report2 

0.3125 

0.071 

0.555 

0.111 

0.4 

0.091 

Report3 

0.625 

0.286 

0.435 

0.174 

0.513 

0.216 

Report4 

1 

1 

0.278 

0.167 

0.435 

0.286 

Report5 

0.5 

0 

0.1 

0 

0.1667 

0 

Report6 

1 

1 

0.583 

0.417 

0.737 

0.588 

Report7 

0.5 

0.5 

0.182 

0.182 

0.267 

0.267 

Report8 

0.444 

0.083 

0.381 

0.048 

0.410 

0.061 

Report9 

0.667 

0 

0.286 

0 

0.399 

0 

Report 10 

0.5 

0.4 

0.375 

0.25 

0.428 

0.308 

Report 11 

0.667 

0 

0.182 

0 

0.286 

0 

Report 12 

0.8 

0.5 

0.470 

0.118 

0.592 

0.19 

Report 13 

1 

1 

0.148 

0.148 

0.258 

0.26 


As we can observe from the evaluation statistics of arbitrary samples of the corpus, the 
precision scores high rates in the most of the cases. It even reaches to 100% rate for 
some reports. On the other hand, the recall scores low rates. This affects the FI measure 
to be lower for the most of the reports. However, in our case, FI measure might not be 
the best evaluation criteria. As mentioned previously, our ultimate goal is to achieve 
high precision annotations for the aviation safety reports in order to be directly used in 
practice. 
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Precision 


Recall 



■ with Aviation vocabulary ■ without aviation vocabulary 



■ With aviation vocabulary ■ without aviation vocabulary 


FI 



■ with aviation vocabulary ■ without aviation vocabulary 


Fig. 2. Reports' evaluation charts 


Flight had a prolonged loss of commu¬ 
nication over Swiss territory. Zurich ra¬ 
dar informed at 09.28.39 about loss of 
contact and that also no contact on 
121,5 MHz could be established. Ge¬ 
neva informed 09.46.35 that ABC1234 
has contacted them. Length of loss of 
comm, is approx. 11 minutes. 


Flight had a prolonged loss of commu¬ 
nication over Swiss territory . Zurich ra¬ 
dar informed at 09.28.39 about loss of 
contact and that also no contact on 
121,5 Mhz could be established. Ge¬ 
neva informed 09.46.35 that ABC1234 
has contacted them. Length of loss of 
comm, is approx 11 minutes. 


Fig. 3. Sample of manually annotated report Fig. 4. Sample of automatically annotated 
(R11) by experts report (R11) by the tool 


Table 2. Entities detected and their types according to the Aviation ontology 4 


Entity Name 

Entity Type 

Flight 

Event 

prolonged loss of communication 

Trope 

territory 

Location 

Geneva 

Location 

radar 

Object 

ABC 1234 

CallSign 


4 onto.fel.cvut.cz/ontologies 
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5 Future work 

The text analyzing tool for the aviation safety reports is aimed to be integrated into the 
reporting process workflow. For DBpedia spotlight, further work can be done regarding 
the disambiguation feature as well as taking context into consideration in the annotation 
process. Furthermore, domain-specific techniques will be taken into consideration. 
More artifacts can be declared and detected within the customized techniques. This will 
eventually raise the recall, precision and ultimately the F1 measure to score higher rates. 
In future research, we will focus on the relations detection between the concepts rather 
than only the concepts themselves. This will guarantee better understanding and analy¬ 
zing for the reports. 

Acknowledgements: This work was partially supported by grants No. TA04030465 Re¬ 
search and development of progressive methods for measuring aviation organization's 
safety performance of the Technology Agency of the Czech Republic, 
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Abstrakt. Na intemete kazdy den pribuda viac a viac textovych dat. Tieto texty 
su zaujimavym zdrojom informacil, ktore mozu posluzit'jednak ostatnym l’udom 
na intemete, ako aj firmam zaoberajucim sa predajom alebo marketingom. V pri- 
pade, ze je tychto dat velini vel’a, su najlepslm riesenlm ako ich analyzovat’ au- 
tomatizovane metody. My sme sa v tejto praci zamerali na kombinaciu 2 metod, 
slovnikoveho pristupu a Naivneho Bayesovho klasifikatora. Tato kombinacia by 
mala odstranit' problemy oboch prlstupov, ked’ze slovnlkovy pristup nevyhodnotl 
prlspevky, ktore neobsahuju slova zo slovnlka a NBK zase potrebuje trenovaciu 
mnozinu, ktoru je narocne ziskat', hlavne pokial’ sa jedna o novu domenu. Pre- 
zentovany pristup dosiahol celkovu presnost’ 57,63 %. 

Typ prispevku: Vyskumny prispevok 

Kl’iicove slova: slovnlkovy pristup, strojove ucenie, Naivny Bayesov klasifika- 
tor, klasifikacia nazorov 


1 Uvod 

Analyza sentimentu je jednou z uloh spracovania prirodzeneho jazyka (NLP - Natural 
Language Processing), ktora zahrna ziskavanie a analyzu emocii autora na nejake pro- 
dukty, urcovanie jeho nazorov na politicku situaciu, alebo hodnotenie recenzii. Tieto 
hodnotenia su vacsinou vyjadrovane na intemete prostrednictvom prispevkov na so- 
cialnych siet’ach alebo blogoch. V poslednych rokoch vjrazne narastlo mnozstvo na¬ 
zorov vyjadrovanych na webe a tieto nazory sa stavaju stredobodom zaujmu mnohych 
vyskumnikov. 

Sentiment analyza ma svoje uplatnenie v roznych oblastiach ako napr. v marketingu, 
kde pomocou socialnych sieti a intemetu sleduje reakcie zakaznikov na nove produkty 
a sluzby. Primarnou ulohou analyzy sentimentu je vyhl’adat’ nazor, identifikovaf senti¬ 
ment, ktory tento nazor vyjadruje a klasifikovat’jeho polaritu. Sentiment vyjadruje l’ud- 
ske pocity, emocie voci nejakemu objektu. Najcastejsie deli do troch kategorii: pozi- 
tivny, neutralny a negativny[2]. 
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2 Metody pouzivane na analyzu sentimentu 

Metody, ktore sa pouzivaju na analyzu sentimentu je mozne rozdelif do dvoch katego- 
ril. Prvou su metody zalozene na strojovom uceni a druhou su metody zalozene na slov- 
nlkovom prlstupe. Okrem tychto dvoch existuju aj hybridne pristupy, ktore tieto dve 
metody kombinuju. Metody strojoveho ucenia pouzivaju algoritmy strojoveho ucenia, 
pomocou ktorych spracovavaju jednotlive lingvisticke vlastnosti analyzovaneho textu. 
Medzi najcastejsie pouzivane klasifikatory patria rozhodovacie stromy, lineame klasi¬ 
fikatory, pravdepodobnostne klasifikatory alebo klasifikatory zalozene na pravidlach. 

V praci Zhang a kol [6] autori porovnavali Naivny Bayesov klasifikator a Metodu 
podpornych vektorov na hodnoteniach restauracii. V praci taktiez rozoberali vplyv re- 
prezentacie a vel’kosti priznakoveho priestoru. Najlepsie presnost’ dosiahol NBK pou- 
zivajuci 900-1100 atributov a to 95,67%. K algoritmom strojoveho ucenia mozeme pri- 
radif aj metody nekontrolovaneho ucenia pristupuju krieseniu problemu bez znalosti 
toho aky ma byt’ vysledok. Pri nekontrolovanom uceni sa priklady zhlukuju do zhlukov 
podl’a nejakeho kriteria, najcastejsie podobnosti. Medzi zakladne algoritmy patriace 
pod nekontrolovane ucenie patria zhlukovacie algoritmy [5]. 

2.1 Metody zalozene na slovnlkoch 

V mnohych ulohach analyzy sentimentu sa vyuzivaju slova, ktore vyjadruju nase na- 
zory a pocity. Tieto slova sa nazyvaju tzv. nazorove slova. Pozitivne slova sa pouzivaju 
na vyjadrenie urciteho pozadovaneho stavu, zatial’ co negativne slova sa pouzivaju na 
vyjadrenie nejakych neziaducich stavov. Zoznam nazorovych slov sa nazyva slovnik 
alebo lexikon. Takyto slovnik sa nasledne pouziva na identifikaciu orientacie prispev- 
kov. Lu a kol [3] urcovali polaritu prispevkov pomocou nasobenia prldavnych mien 
a prisloviek. Ich pristup dosiahol presnost’ 71,7%. Moznostiam rozsirenia o vyuzitie 
intenzifikatorov a negatorov je venovana praca Kennedy a Inkpen [1]. Vo svojej praci 
porovnavali vysledky slovnika pouzivajuceho intenzifikatory a negatory a slovnika bez 
nich. Slovnik pouzivajuci intenzifikatory a negatory dosiahol lepsie vysledky (prie- 
meme 67,8%) ako povodny slvonlk (v priemere 66,5%). 

3 Naivny Bayesov klasifikator 

V nasej aplikacii sme sa rozhodli pouzit’ Naivny Bayesov klasifikator (NBK) z dovodu 
jeho jednoduchosti a relativne dobrym vysledkom. Je to jednoduchy klasifikator zalo- 
zeny na Bayesovskej teoreme, so silnym dorazom na nezavislost’ atributov. Vel’mi 
casto sa pouziva na jednoduchu klasifikaciu textov ako napr. vyhl’adavanie spamu, fil- 
trovanie mailov, kategorizaciu dokumentov, detekcia jazyka a analyzu sentimentu. Na- 
priek silnemu dorazu na nezavislost’ atributov dosahuje Naivny Bayesov klasifikator 
vel’mi dobe vysledky aj v aplikaciach realneho sveta. 

V praxi sa pouzivaju dva typy NBK [4]: 
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• Multinomialny NBK - tento variant, odhaduje podmienenu pravdepodobnost’ urci- 
teho slova k danej triede ako relativnu pocetnost’ terminu v dokumentoch, ktore pat- 
ria do kategorie C. Berie sa do uvahy pocet vyskytov termu v trenovacich dokumen¬ 
toch triedy C, vratane viacnasobnych vyskytov. 

• Bemoulliho NBK - generuje binarny ukazovatel’ pre kazdy term zo slovnika, kde sa 
prida hodnota 1, aktermin sa vyskytuje v dokumente a 0, aknie. Tento model nebe- 
rie sa do uvahy pocet vyskytov termov a beru do uvahy aj termy, ktore sa v doku¬ 
mente nevyskytli. 

4 Navrh a implementacia hybridneho modelu 

Nasou ulohou bolo vytvorit’ hybridny model pre klasifikaciu nazorov, ktory by kombi- 
noval slovnikovy prlstup a metodu strojoveho ucenia. Bola vytvorena aplikacia, ktora 
najskor vytvorila trenovaciu mnozinu pomocou slovnikoveho prlstupu, nasledne na 
tejto trenovacej mnozine naucila NBK. Nasledne bol NBK pouzity na testovaciu mno¬ 
zinu. Aplikacia bola rozdelena na 3 casti: 

Prva cast’ aplikacie je zamerana na predspracovaniu ziskanych dat. V tejto casti sa 
odstrania stop slova, odstrani sa diakritika, rozdelia sa slova a taktiez niektore slova sa 
prevedu na zakladny tvar. Takto predspracovane data nam potom sluzili ako vstup do 
d’alsej casti aplikacie. 

Druha cast’je venovana trenovaniu NBK, kde sa z jednotlivych prispevkov vytvori 
zoznam slov, ku ktorym boli vyratane apriorne pravdepodobnobnosti. 

V predposlednej casti bol aplikovany NBK na testovacie prispevky. 

Proces klasifikacie bol rozdeleny na dve casti. V prvej casti sa natrenuje NBK 
a v druhej je natrenovany klasifikator pouzity na klasifikaciu testovacich pripadov. Na 
zaciatku sa inicializuju prazdne matice, kde pocet stlpcov zodpoveda poctu slov v slov- 
niku a pocet riadkov zodpoveda poctu prispevkov v jednotlivych triedach. Na tychto 
maticiach prebieha ucenie klasifikatora. Matic je tol’ko, kol’ko tried budeme mat’ pri 
klasifikacii. Po nacitani sa prechadzaju prispevky a zist’uje sa, ci je prispevok pozitivny 
alebo negativny. Prispevky sa rozdelia na slova a pre kazde jedno slovo sa zist’uje, ci 
sa nachadza v zozname slov alebo nie. V pripade, ak sa slovo v zozname slov nenacha- 
dza na danom mieste ostava hodnota 0, a v pripade ak sa slovo nachadza v zozname 
slov, tak zo slovnika sa vytiahne aprioma pravdepodobnost’ pre dane slovo, a tato hod¬ 
nota sa zapise na poziciu daneho slova. Takymto sposobom sa vytvoria matice pre po- 
zitivnu a pre negativnu triedu. V procese klasifikacie sa nasledne musi inicializovat’ 
nulovy vektor pre jednotlivy prispevok. Tento vektor bude taky dlhy, kol’ko slov sa 
nachadza v zozname slov. Po inicializovani sa nacita prispevok, po jednom a rozdeli sa 
na slova. Neznamy prispevok sa prechadza, slovo po slove a kazde slovo je dopytovane 
vzhl’adom na slovnik. Ak sa dane slovo v slovniku vyskytuje, do vektora sa zapise hod- 
note apriornej pravdepodobnosti zo slovnika. Konecne hodnoty pravdepodobnosti sa 
nasledne pocitaju pre pozitivnu ale aj pre negativnu triedu zvlast’. Nakoniec je prispe¬ 
vok zaradeny do triedy s vyssou pravdepodobnost’ou. 
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5 Testovanie a vyhodnotenie aplikacie 

Ako testovacie data sme vybrali prispevky vopred ohodnotene expertom. Prispevky 
predstavuju diskusiu na rozne temy od politiky az po recenzie na spotrebnu elektroniku. 
Cely korpus sa sklada z 5242 prispevkov. Z tohto poctu bolo 4191 vyhodnotenych po- 
mocou slovnlkoveho prlstupu. Vysledky analyzy pomocou slovnlkoveho prlstupu boli 
d’alej pouzite ako trenovacia mnozina pre NBK. Zvysnych 1051 prispevkov, ktore ne- 
boli vyhodnotene pomocou slovnlkoveho prlstupu bolo nasledne ako testovacia mno¬ 
zina pre NBK. Z Tychto 1051 prispevkov bolo 828 negatlvnych a 223 pozitlvnych ko- 
mentarov. Vysledky presnosti a navratnosti su zobrazene v Tab. 1. 


Tab 3. Tabul’ka obsahujuca presnosti na navratnosti NBK pouziteho v ramci hybridneho 

pristupu. 



Presnost’ (%) 

Navratnost’ (%) 

pozitlvne prispevky 

29,31 

61,69 

negatlvne prispevky 

85.94 

61,17 


Aplikacia dosiahla priememu presnost’ 57,63% a priememu navratnost’ 61,43%. F- 
miera pre pozitlvne prispevky bola 0,39 a pre negatlvne prispevky 0,71. Dosiahnute 
vysledky mohli byt’ ovplyvnene niekol’kymi faktormi. Prvym mohla byt’ nie uplne 
presna klasifikacia trenovacej mnoziny, ktora bola vytvorena pomocou slovnlkoveho 
pristupu. Dalslm faktorom mohla byt’ nevyvazenost’ prispevkov v testovacej mnozine. 

6 Zaver 

V tejto praci bol poplsany hybridny pristup k analyze sentimentu. Bol vytvoreny mo¬ 
del, ktory pouzil slovnlkovy pristup na vytvorenie trenovacej mnoziny pre metodu stro- 
joveho ucenia. Z metod strojoveho ucenia bol vybrany Naivny Bayesov klasifikator, 
ktory bol po naucenl pouzity na prispevky, ktore slovnlkovy pristup nedokazal vyhod- 
notit’. Priememe vysledky okolo 57% resp. 61% nedosiahli nase ocakavania, ale bolo 
by ich mozne vylepsit’ filtrovanlm prispevkov, ktore budu pouzite na trenovanie, alebo 
vyvazenlm testovacej mnoziny. 

Pod’akovanie. Tento prlspevok vznikol s podporou Vedeckej grantovej agentury Mi- 
nisterstva skolstva, vedy, vyskumu a sportu Slovenskej republiky v ramci projektu c. 
1/0493/16 „Metody a modely pre analyzu prudov dat“. 
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Annotation: 

Hybrid approach for opinion classification 

There are new text data on Internet every day. These data contain a lot of interesting information, 
which can be useful for other people and also for companies, that deal with selling and marketing. 
In case, that we have huge amount of these data, it is useful to analyze them automatically. In our 
paper we focused on combination of 2 approaches for sentiment analysis, dictionary approach 
and Naive Bayes classifier. This approach can solve the problem when the dictionary approach 
does not analyze any comments, because they do not contain any word from the dictionary. The 
second problem is that Naive Bayes classifier needs training dataset, which can be difficult to 
obtain especially for new domain. The described approach achieved accuracy 57,63%. 
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Abstrakt. Clanok prezentuje novy korpus motivacnych prednasok TEDxSK 
a JumpSK v slovencine. Recova databaza pozostava z 220 prednasok v trvani 58 
hodin. Anotovana mnozina recovych nahravok bola vytvorena automaticky bez 
dohl’adu pomocou akustickej segmentacie reci na baze analyzy hlavnych kom- 
ponentov a automatickej transkripcie pomocou dvoch komplementarnych syste- 
mov na rozpoznavanie reci. Pre potreby hodnotenia kvality automatickeho pre- 
pisu reci do textu bola vytvorena evaluacna mnozina 50 prednasok v trvani 12 
hodin s dodatocnym manualnym prepisom. Pomocou automatickej anotacie sme 
z recoveho korpusu ziskali 21,26% novych recovych dat z celkovej doby trvania 
recoveho korpusu pri zachovani 9,44% miery chybovosti automatickeho prepisu 
vhodnych na dotrenovanie, resp. adaptaciu povodneho akustickeho modelu. 

Typ prispevku: Prispevok o prebiehajucom vyskume 

Kl’iicove slova: automaticke rozpoznavanie reci, automaticka anotacia, recovy 
korpus, akusticke modelovanie. 


1 Uvod 

Vyvoj neustale presnejslch systemov na automaticky prepis reci do textu vyzaduje ob- 
rovske mnozstvo dat na estimaciu statistickych parametrov reci a jazyka, ktore by po- 
kryli co mozno najviac javov, ktore sa v spontannom recovom prejave vyskytuju. Pri 
tvorbe robustnych akustickych modelov sa preto vyzaduje vybudovat’ foneticky bohaty 
a z pohl’adu pohlavia vyvazeny recovy korpus, ktory by obsahoval radovo stovky az 
tisice hodin anotovanych recovych nahravok. Vytvorif taketo mnozstvo dat manualne 
skolenymi pracovnikmi by zabralo neumerne vel’a casu, ale aj fmancnych prostriedkov. 
Proces manualnej anotacie je umemy v priemere osem az desat’ nasobku dobe trvania 
recovej nahravky [1], Pri existencii urciteho, aj ked’ len maleho mnozstva manualne 
anotovanych recovych dat je v sucasnosti mozne pomocou najmodemejsich pristupov 
a metod vybudovat’ komplexny system na automaticku anotaciu a tvorbu novych reco¬ 
vych databaz, ktore by mohli byt’ nasledne pouzite napr. pri reestimacii parametrov 
akustickeho modelu, resp. pri jeho adaptacii na hlasove charakteristiky hovoriacich. 



Automaticka anotacia a tvorba recoveho korpusu prednasok TEDxSK a JumpSK 128 


Problemy pri vytvarani rozsiahlych recovych databaz mozno vidiet’ aj na strane po- 
skytovatel’ov zdrojovych dat, s ich suhlasom. Z toho dovodu sa hl’adaju take zdroje, 
ktore su vol’ne dostupne sirokej verejnosti. Jednym z takychto zdrojov je aj databaza 
prednasok z konferencii TED (skr. z angl. technology, entertainment, design), ktore su 
organizovane po celom svete a propaguju tzv. „myslienky hodne sirenia“. 

Vzhl’adom na tematicku roznorodost’ prednasok a bohate zastupenie recnikov, stali 
sa dobrym podkladom na tvorbu recovych databaz vo viacerych jazykoch. Jednou z 
najznamejsich databaz je TED-LIUM [2], ktora obsahuje celkovo 1495 automaticky 
anotovanych prednasok TEDx v anglickom jazyku. Miera chybovosti automatickeho 
prepisu (z angl. word error rate, skr. WER) dosahuje uroven 17,40% v priemere. Sys¬ 
tem na automaticky prepis reci je zalozeny na pat’-prechodovom dekodovani reci s po- 
stupnou adaptaciou akustickych a jazykovych modelov a reskorovanlm hypotez. Z naj- 
novslch recovych databaz mozno spomenut’ SI TEDx-UM [3], obsahujucu 242 auto¬ 
maticky anotovanych prednasok TEDx v slovinskom jazyku. Miera chybovosti auto¬ 
matickeho prepisu WER bola v tomto pripade vyhodnotena pomocou systemu na prepis 
spravodajskych relacii BNSI a dosahovala uroven az 50,70% v priemere. 

Tento clanok prezentuje novu recovu databazu prednasok TEDxSK a JumpSK, ano- 
tovanu automaticky bez dohl’adu pomocou dvoch komplementamych systemov na roz- 
poznavanie reci v slovencine s filtraciou hypotez s minimalnym mnozstvom chyb. Da¬ 
tabaza prepisov krecovym nahravkam bude zverejnena sirokej verejnosti do konca 
roka 2016 na webovej stranke projektu Laboratoria recovych a mobilnych technologii 1 . 

2 Struktura recoveho korpusu 

Nasim ciel’om bolo vybudovat’ automaticky anotovanu recovu databazu, obsahujucu 
nahravky v co mozno najlepsej kvalite s jednym, resp. dvoma recnikmi na prednasku. 
Zdrojove data boli ziskane z kanalov TEDx Talks 2 a Jump Slovensko 3 prostrednictvom 
internetovej sluzby YouTube. Zo zoznamu priblizne 300 motivacnych prednasok z 10 
podujati zverejnenych v rozmedzi rokov 2010 az 2016, boli manualne vyradene vsetky 
cudzojazycne prednasky a nahravky v nizkej kvalite. Z celkoveho mnozstva recovych 
nahravok bolo vyselektovanych 220 prednasok v slovenskom jazyku v celkovom trvani 
priblizne 58 hodin. Vsetky audiovizualne zaznamy boli stiahnute vo formate H.264. 
Zachytena audiostopa bola zakodovana vo zvukovom formate MPEG AAC. Konverzia 
komprimovaneho audia do formatu WAV (v 16-bit PCM mono audio) bola vykonana 
pomocou nastroja SoX 4 . Vsetky audiosubory boli podvzorkovane na 16 kHz, kvoli 
kompatibilite so systemom na automaticke rozpoznavanie reci. Recovy korpus zahrna 
celkovo 227 unikatnych recnikov, ztoho 154 muzov a 73 zien. Zastupenie zenskych 
hlasov je priblizne 30% z celkovej doby trvania novovytvorenej recovej databazy. Po- 
drobny prehl’ad o zastupeni jednotlivych kategorii v novovytvorenom recovom korpuse 
220 motivacnych prednasok TEDxSK a JumpSK je uvedeny v Tab. 1. 


1 http://nlp.web.tuke.sk 

2 https://www.youtube.com/user/TEDxTalks 

3 https://www.youtube.com/user/jumpslovensko 

http://sox.sourceforge.net/ 
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Tab. I Struktura recoveho korpusu motivacnych prednasok TEDxSK a JumpSK. 


nazov podujatia 

pocet 

prednasok 

pocet 

recnikov 

z toho 

muzi 

z toho 

zeny 

celkove 

trvanie 

z toho 

muzi 

z toho 

zeny 

TEDx Bratislava 

57 

61 

42 

19 

13:03:55 

09:02:35 

04:01:20 

TEDx Kezmarok 

9 

10 

6 

4 

02:48:06 

01:59:18 

00:48:48 

TEDx Kosice 

30 

30 

24 

6 

08:50:03 

07:24:35 

01:25:28 

TEDx Nitra 

14 

14 

12 

2 

04:13:37 

03:33:07 

00:40:30 

TEDx Presov 

17 

17 

11 

6 

05:57:31 

04:07:32 

01:49:59 

TEDx Trencin 

24 

25 

14 

11 

05:50:43 

03:36:40 

02:14:03 

TEDx Tmava 

9 

9 

6 

3 

02:21:53 

01:42:20 

00:39:33 

TEDxYouth Bratislava 

20 

20 

15 

5 

05:36:39 

04:06:05 

01:30:24 

TEDxYouth Zilina 

6 

6 

4 

2 

01:41:34 

01:06:59 

00:34:35 

Jump Slovensko 

34 

35 

20 

15 

07:27:35 

04:12:44 

03:14:51 

SPOLU 

220 

227 

154 

73 

57:51:36 

40:51:55 

16:59:41 


3 Automaticka segmentacia a anotacia recoveho korpusu 

Princip automatickej segmentacie a anotacie recoveho korpusu prednasok pomocou 
dvoch komplementamych systemov na rozpoznavanie reci je zobrazeny na Obr. 1. 

Automaticka segmentacia reci pracuje na principe segmentalnej analyzy hlavnych 
komponentov (z angl. principal component analysis, skr. PCA) aplikovanej na casove 
vzorky mikrosegmentu reci, na ktorom sa vypocitaju a analyzuju jeho vlastne hodnoty. 
Pomocou nich sa determinuje charakter segmentu (recova aktivita/tichy segment). Po- 
stupnym vyhladzovanim a zasobnikovym akumulovanim elementamych segmentov sa 
vytvoria kontinualne segmenty reci bez tichych usekov na zaklade preddefinovanej 
konfiguracie parametrov nami navrhnuteho akustickeho segmentatora reci [4], 

Inspirovani pracou [5] sme navrhli a vytvorili komplexny system na automaticku 
anotaciu mohutnych recovych databaz pracujuci bez dohl’adu, ktory je zalozeny na 
komplementarite dvoch systemov na rozpoznavanie plynulej reci s vel’kym slovnikom 
(z angl. large vocabulary continuous speech recognition, skr. LVCSR) v slovencine 
[ 1] [4] - Komplementarita spocivala v pouziti dvoch roznych akustickych modelov. Prvy 
z nich bol natrenovany na databaze anotovanych nahravok diktovanej reci v rozsahu 
320 hodin, druhy model na databaze anotovanej spontannej reci v rozsahu 330 hodin. 
Trigramovy model slovenskeho jazyka bol obmedzeny slovnikom 500A' unikatnych 
slov a v procese dekodovania reci bol pouzity vol’ne dostupny system LVCSR Julius 
[6] s reskorovanim hypotez pomocou mieme modifikovaneho algoritmu ROVER [7]. 

Po automatickej transkripcii recovych nahravok dochadza k porovnaniu a filtracii 
vystupnych hypotez z oboch rozpoznavacich systemov LVCSR 1 a 2. V d’alsom kroku 
su vystupne hypotezy zarovnane, pricom sa zohl’adnuje okrem minimalneho poctu 
zhodnych slov v zarovnanych hypotezach, tiez maximalne casove odsadenie slov od 
zaciatku a konca recovej nahravky a miera doveryhodnosti spravneho rozpoznania slov 
(z angl. confidence measure score, skr. CMS). Vystupom su potom kratke, automaticky 
anotovane segmenty reci, ktore presli krokom filtracie [1], 
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Obr. 1. Automaticka segmentacia a transkripcia reci do textu pomocou dvoch komplementar- 
nych systemov na automaticke rozpoznavanie reci v slovencine. 

Experimentalne bolo tiez zistene, ze minimalne casove odsadenie slov od zaciatku 
a konca recovej nahravky je vhodne nastavit’ na hodnotu 20 ms a pocet zhodnych slov 
v zarovnanej hypoteze by mal byt’ rovny minimalne trom slovam [1 ]. 

Tab. 2 Mnozstvo ziskanych dat po automatickej segmentacii a transkripcii recoveho korpusu. 


mnozina 

skutocne 

trvanie 

trvanie po aut. 

segmentacii 

nastavenie c. 1 

~ 13,57% WER 

nastavenie c. 2 

~ 9,44% WER 

nastavenie c. 3 

~ 4,94% WER 



mnozstvo ziskanych dat 

[ hh:mm:ss ] 


eval 

12:26:07 

11:50:37 

05:39:30 

02:47:35 

00:39:43 

dev 

45:25:29 

43:13:12 

19:37:41 

08:54:47 

02:01:04 

eval + dev 

57:51:36 

55:03:49 

25:17:11 

11:42:22 

02:40:47 


mnozstvo ziskanych dat v [%] 

eval 


95,24 

47,78 

23,58 

5,59 

dev 


95,15 

45,41 

20,62 

4,67 

eval + dev 


95,17 

45,92 

21,26 

4,87 


V prvom kroku budovania recoveho korpusu prednasok TEDxSK a JumpSK sme roz- 
delili korpus na dve casti: evaluacnu ( eval) a vyvojovu {dev) cast’. Evaluacna cast’ 
v rozsahu 12 hodln bola dodatocne manualne anotovana skolenymi pracovnlkmi - ano- 
tatormi. Na tejto mnozine bola vyhodnotena ucinnost’ automatickej transkripcie reci do 
textu vo viacerych roznych nastaveniach (pozri Tab. 2, nastavenie c. 1 az 3). Tieto hod- 
noty nastavenia systemu boli zvolene s ciel’om zlskat’ urcity objem anotovanych reco- 
vych dat s ohl’adom na ich kvantitu (nastavenie c. 1), resp. ich kvalitu (nastavenie c. 3). 
Nastavenie c. 2 predstavuje kompromis medzi kvalitou a kvantitou automaticky anoto¬ 
vanych recovych dat. Nasledne boli tieto hodnoty nastavenia systemu pouzite pri auto¬ 
matickej anotacii zvysnej, vyvojovej casti korpusu. Celkove mnozstvo ziskanych dat 
po automatickej anotacii recoveho korpusu prednasok je zhrnute v Tab. 2. 

Z tabul’ky mozno pozorovat’, ze pri chybovosti automatickeho prepisu 13,57% WER 
sme ziskali priblizne 45,92% novych anotovanych recovych dat, ktore mozu byt’ pou- 
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zite napr. na reestimaciu parametrov existujuceho akustickeho modelu, resp. jeho adap- 
taciu. Podobne, pri chybovosti 9,44% WER sme zlskali 21,26% novych dat a pri chy- 
bovosti 4,94% WER to bolo priblizne 4,87% dat z celkoveho mnozstva 58 hodln. 

4 Zaver 

V tomto clanku bol v kratkosti predstaveny novovytvoreny korpus prednasok TEDxSK 
a JumpSK. Anotovana mnozina 220 recovych nahravok bola vytvorena automaticky 
bez dohl’adu pomocou systemu na automaticku anotaciu a tvorbu recovych databaz, 
ktory je zalozeny na komplementarite dvoch systemov na rozpoznavanie plynulej reci 
v slovencine. Databaza prepisov k recovym nahravkam prednasok TEDxSK a JumpSK 
bude zverejnena sirokej verejnosti do konca roka 2016 na webovej stranke projektu. 

Pod’akovanie: Tento vyskum bol realizovany vd’aka podpore Kultumej a edukacnej 
grantovej agentury na zaklade Zmluvy c. 055TUKE-4/2016 a vd’aka podpore Agentury 
na podporu vyskumu a vyvoja na zaklade Zmluvy c. SK-HU-2013-0015 a realizaciou 
vyskumneho projektu APW-15-0517, fmancovanych z prostriedkov MSVVaS SR. 
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Annotation: 

Automatic annotation and building of a speech corpus of TEDxSK and JumpSK talks 

The paper presents a new Slovak spoken language resource built from TEDxSK and JumpSK 
lectures. The presented speech database consists of 220 lectures in total duration of 58 hours. 
Annotated speech corpus was generated automatically, in an unsupervised manner, by using 
acoustic speech segmentation based on a principal component analysis and automatic speech 
transcription using two complementary speech recognition systems. For evaluation of quality of 
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automatic transcription of speech, an evaluation set composed of 50 lectures, in total duration of 
12 hours with manual transcription, has been created. Using automatic annotation of TEDxSK 
and JumpSK lectures, we have obtained 21,26% of a new speech data with 9,44% word error 
rate, suitable for re-training or adaptation of the original acoustic model. 
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Abstract. In this work we describe otazkovac, a simple (web or command line) 
application capable of generating questions from specific types of Slovak senten¬ 
ces, provided it is fed with appropriate data. It is intended to be used as a submo¬ 
dule of Multimedialna Citanka, a web application thanks to which kids in the 
first three grades of Slovak primary schools learn how to read with some help of 
a computer. It uses a set of methods which are known to produce state of the 
art results in question generation problems on English text corpora. As part 
of this work we present a dataset based on stories from Multimedialna Citanka 
that can be used in further research on question generation from Slovak texts. 

Contribution type: Application paper 

Keywords: question generation, part of speech tagging, Slovak text corpora 


1 Introduction 

For the purpose of this work we will use the definition of Question Generation from [1] 
where it is defined as „the task of automatically generating questions from some form 
of input. The input could vary from information in a database to a deep semantic repre¬ 
sentation to raw text. Question Generation is viewed as a three-step process: content 
selection, selection of question type and question construction." 

The application presented in this work, otazkovac, tries to solve this task in a 
specific context of Multimadialna Citanka[2], which is a web application that helps 
children with improving their reading skills by analyzing a recording of their speech in 
real time and providing them with feedback on how accurately and how fast were they 
able to reproduce a given text (usually a short story). It also attempts to asses their 
comprehension skills by asking questions related to presented text. Since the creation 
these questions by hand is a laborious task 1 a need for an automated solution arose. The 
aim of otazkovac is to fulfill this need. 


1 Especially considering the amount of texts in the database of Multimedialna Citanka and the 
fact that new texts are being continuously added. 
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While many complex and involved methods for Question Generation in educational 
contexts exist[3], given the intended use case of otazkovac it seems that a simple 
procedure of finding an appropriate sentence, detecting its type from the first few words 
and then syntactically transforming it into a question might be sufficient. A more de¬ 
tailed description of this procedure follows in the subsequent sections. 

2 Question Generation 

In order for otazkovac to find sentences that could potentially be turned into ques¬ 
tions two stages are required: splitting text into sentences and detecting whether a sen¬ 
tence starts with a preposition. Both of these tasks can be performed by MorphoDiTa: 
Morphological Dictionary and Tagger[4], provided that it will get a pre-trained lan¬ 
guage model as an input. We were provided such a model by the Slovak National Cor¬ 
pus. While most of the publicly available Part of Speech (POS) taggers use the Penn 
Treebank POS tags, Slovak National Corpus uses a specific set of tags[5] that reflects 
the nature of Slovak language and provides more morphological information 2 . Thanks 
to these tags, sentences which start with prepositions can easily be identified, as well 
as other words which belong to the “prepositional” part of the sentence. 

An example of a tagged sentence looks as follows: 


E-6-u- - Po 

AAis6-x-- dobrom 

SSis6-- kupeli 

R- - sa 

V-ms-cL-A-d-- rozlucil 

E-7-u- - s 

SSis7-- mesiacikom 

0- - a 

SSfp7-- hviezdickami 

0- - a 

V-ms-cL-A-d-- lahol 

R- - si 

V-I-A-e-- spat 

Z - . 


As we can see in the example, the first word's tag starts with E, which means that it is 
tagged as a preposition. Note that the next two words share the number (6) in their tags, 
in the same position as the first word. This is due to the fact that this position is used 
for the case of a word, and the tagger thinks that these words are all in the 6th Slovak 
case — Locative. These three words can then be replaced with Kedy, the full stop can 
be replaced with a question mark and the result is a sentence that might constitute a 
fairly good question: 


2 Note that the tags discussed below come directly from the MorphoDiTa model and are slig¬ 
htly different from those described in [5] 
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Po dobrom kupeli sa rozlucil s mesiacikom a hviezdickami a l’ahol si spat’, 
then becomes 

Kedy sa rozlucil s mesiacikom a hviezdickami a l’ahol si spat’? 


3 Question Type Detection 


Let us consider another example of a sentence with tags outlined for each word: 


E-6-u- 

SSfs6- 

V—p-aK-A-e- 

R - 

V-hp-aL-A-d- 

E-7-u- 

AAms7-x- 

SSms7- 

Z - . 


V 

chate 

sme 

sa 

stretli 

s 

d’alsim 

polovnikom 


As we can see the first two words in this example are of a type which is similar to the 
first three words in the first example. However, replacing them with Kedy does not 
seem like an option since in Slovak language: if the preposition v is followed by an 
object and this object itself is not a time reference (such as weekday or name of a month) 
this “prepositional clause” is most probably associated with a place, not a time. There¬ 
fore Kde would be way more appropriate in this case than Kedy. 

Just from these two examples it is obvious that in order for otazkovac to create 
correct and relevant questions it needs to be able to detect what type of a question can 
be generated from a given sentence (if any). To do so we gathered a dataset of sentences 
that could possibly be transformed into questions as described above from all stories 
available on Multimedialna Citanka. MorphoDiTa models also provide the lemma 
along with a tag, and so we included this information in the dataset in order to make 
the detection more robust and prone to variation in natural language. 

The dataset consists of 695 tagged sentences. Unfortunately, the premise from above 
does not hold in general (thanks to variability in natural language) and there are senten¬ 
ces like “O zvieratach sa docitam v encyklopedii Svet zvierat” in the dataset that do not 
fall in either the Kde (marked P for Place in the dataset) or the Kedy (marked T for 
Time in the dataset) category. These sentences should be Ignored and are therefore 
marked I in the dataset. The final dataset contains 431 sentences in the Place category, 
240 sentences in the Time category and 24 sentences in the Invalid category. 


3.1 Feature Engineering 

In order to use a machine learning algorithm for detection of question type it is ne¬ 
cessary to represent its inputs as a set of features. A natural choice for features in a 
scenario like this would be n-grams over the list of lemmatized words. A slightly better 
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alternative might be to treat POS tags as words too. The motivation behind such a de¬ 
cision is that for instance a sentence of type P is more likely to have the preposition na 

followed by some sort of a noun represented by a tag SSis6 -rather 

than a specific noun itself. Since the dataset we have is very small in size, this setup 
should help us capture more variability in the data. One last improvement that might 
help even more would be the addition of concatenated bigrams from the beginning and 
the end of the list so that the words “v poslednej zakrute” would be represented by a 
feature vector similar to “v zakrute” since the middle word does not change the type. 

3.2 Model Selection 

There are multiple models to choose from when it comes to text classification. We 
might use a multinomial Naive Bayes classifier (NB) as a baseline, random forest clas¬ 
sifier (RF) as an example of a model that tends not to overfit, and a SVM which is one 
of the recommended models when it comes to text classification on small datasets. All 
of the models were tested in combination with the features described above using 10- 
fold cross validation. The results are provided below: 

Tab 4. The resulting testing accuracies of tested combinations of models and features 



NB 

RF 

SVM 

2-3 normal 

87.36% 

83.90% 

85.90% 

2-4 normal 

88.21% 

85.76% 

86.19% 

1 -4 special 

89.49% 

85.62% 

89.06% 

2-4 revers 

89.49% 

87.06% 

89.06% 

2-3 revers 

89.49% 

88.78% 

90.64% 


The numbers are the degrees of grams which were used (2-3 grams means that 
bigrams and trigrams were used), normal means setup with just lemmatized words, 
special is the setup described in the last paragraph of the section above and 
revers is the setup in which POS tags are treated as words. As it turns out our special 
handcrafted features are at best the same as POS tags with 4-grams. When we train the 
best model on the whole dataset we get the accuracy of 98.27 percent. 

1.1 Error Analysis 

It might be interesting to see in which cases did the model failed to predict the correct 
class. Let us consider the following sentence: 

Na konci mesta si nasiel lietajucu motorku a ukradol ju. 

Unfortunately in this case the MorphoDiTa model decided that the third word ‘mesta’ 
used a different case than the two before. This is not true, but given our premise descri¬ 
bed above (see section 2) only the first two lemmatized words were treated as features 
instead of the first three of them, which greatly affected the result. 

Another example of a mistake made by model can be seen in the following sentence: 
Na Havaj sa teslm. 
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In this case the MorphoDiTa model thinks that Havaj is an abbreviation or a special 
entity of sorts, which is a situation our premise is not ready for. However, we can also 
see that in this case a completely different type of question could be generated (namely 
using Kam). It also needs to be noted that given the simplistic nature of the problem at 
hand (only questions starting with Kedy and Kde are generated), many sentences (such 
as for instance Z chaty vysla macka) will not be considered. This shows that there is 
potential for future improvement. 

4 Conclusions and Future Work 

We present an implementation of a simple method for generating specific questions 
from unstructured Slovak text. This method incorporates POS tagging as well as super¬ 
vised learning of “question types” for sentences that start with a preposition. Although 
it is still considered to be work in progress, preliminary tests show that the questions it 
currently generates are sufficient in the context of Multimedialna Citanka. 

While this work's focus is on just one possible way of generating questions thanks 
to simple sentence transformation 3 this approach might be reusable in other contexts. 
With an appropriate training dataset and a MorphoDiTa model, it can also be used for 
another language. We would also like to note that this project is licensed under the 
GNU GPL license and can be obtained along with the aforementioned dataset from 
https://github.com/mrshu/otazkovac 

Acknowledgment: We would like to thank E. Stur Institute of Linguistics, Slovak 
Academy of Sciences for providing us with a MorphoDiTa model of Slovak language. 
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Abstrakt. V ramci zacinajuceho projektu Informacne spravanie sa cloveka v di¬ 
gitalnom priestore sa zaoberame interpretaciou zaznamov cinnosti pouzivatel’ov 
v digitalnych priestoroch. Ciel’om projektu je hlbsie porozumenie a tvorba metod 
a modelov automatickej interpretacie spravania. Do uvahy berieme tradicne 
zdroje spatnej vazby ako aj vstupy z dosial’ vel’mi nevyuzivanych senzorov ako 
okulografy vo vyskumnom centre pouzivatel'skeho zazitku a interakcie na FIIT 
STU (uxi@fiit, http://uxi.sk). 

Typ prispevku: Prispevok o prebiehajucom vyskume 

Kl’iicove slova: spravanie pouzivatel’ov, analyza dat, sledovanie pohl’adu 


1 Motivacia: porozumiet’ interakcii cloveka s aplikaciami 

Poznat’ spravanie cloveka v digitalnom priestore je dolezite pre uspech kazdej aplika- 
cie, ktoru tento clovek pouziva. Spravanie, teda postupnost’ akcii pouzivatel’a v aplika- 
cii, nesie spolu s kontextom vel’a informacii o tom, ako aplikaciu pouziva, ci v nej do- 
sahuje svoje ciele, ci je prijemna a podobne. 

Ak vieme spravanie identifikovat’ (klasifikovaf) automaticky, moze nan aplikacia 
reagovat’ (napriklad zabranit’jeho odchodu z aplikacie alebo vhodne odporucit’ obsah). 
Automaticke zist’ovanie spravania ma okrem toho vyznam aj ex-post: urahcuje analy- 
tiku aplikacie, uzitocnu pri analyze pouzitel’nosti, marketingovych strategii a pod [2], 
Surove data zachytavajuce spravanie su vsak nizkourovnove a je potrebne ich inter- 
pretovat’. Elementarne pouzivatel’ske akcie nachadzajuce sa v zaznamoch len malokedy 
priamo hovoria o motivoch, ciel’och ci pocitoch pouzivatel’ov. Je vsak t’azkym proble- 
mom interpretovat’ zaznamy o pouzivani do podoby symbolickej reprezentacie, ktora 
by typy spravania pouzivatel’ov explicitne pomenovala. Cast’ pristupov sa snazi tuto 
interpretaciu obist’ riadenim efektov adaptacie cez strojove ucenie (napr. predikcia od¬ 
chodu pouzivatel’a z aplikacie [1]). Taketo pristupy vsak casto prilis zavisia od kon- 
kretnej domeny a vyzaduju vel’ke vzorky trenovacich dat. 
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Interpretacii a hlbsiemu porozumeniu spravania sa cloveka v digitalnom priestore sa 
venujeme v zacinajucom projekte zakladneho vyskumu HIBER. Vyskum rozvijame 
predovsetkym na vybudovanej infrastrukture Vyskumneho centra pouzivatel’skeho za- 
zitku a interakcie, v laboratoriach UXI@FIIT, v ktorych mame k dispozicii viacero sen- 
zorovych technologii umoznujucich podrobne zaznamenavanie elementamych akcii 
spravania sa cloveka pri praci s pocitacom a teda aj pohybom v digitalnom priestore. 
Projekt nadviaze na viacere existujuce vyskumy zamerane na detekciu vzorov sprava¬ 
nia (snaha o zavadzanie a podvodne spravanie) a typickych stavov pouzivatel’ov 
(emocne vybudenie, kognitivna zat’az). 

2 Doterajsi vyskum spravania sa pouzivatel’ov v UXI@FIIT 

UXI@FIIT funguje na fakulte viac nez rok. Zameriavame sa na automatizovane vy- 
hodnocovanie pouzitel’nosti a podporou pouzivatel’skych studii (vizualizacie dat, ano- 
tacne nastroje). Zaroven znacnu cast’ usilia venujeme vyskumu metod automatickeho 
zist’ovania stavu pouzivatel’ov. Okrem analyzy “tradicnych” zdrojov spatnej vazby 
(najma sekvencie zaznamov pouzivania aplikacii) vyuzivame analyzu dat zo speciali- 
zovanych senzorov, ktorymi je laboratorium vybavene (ide najma o sledovanie pohl’adu 
a d’alej zaznamy hlbkovych kamier, EEG a senzorov fyziologie l’udskeho tela). V ramci 
niekol’kych mensich projektov sme skumali moznosti detekcie vseobecnych (na do- 
mene co najviac nezavislych) stavov, v ktorych sa pouzivatelia mozu nachadzat’ a ve- 
domost’ o ktorych moze do znacnej miery prispiet’ k vhodnej adaptacii aplikacii. 

Odhad emocionalneho vybudenia a kognitfvnej zat’aze pomocou merania roz- 
slrenia zreniciek. Zmena priemeru zreniciek nastava v pripade zvysenia emocional¬ 
neho vybudenia alebo kognitivnej zat’aze [3], Pouzitie sledovania pohl’adu na zist’ova- 
nie tychto stavov komplikuje reaktivnost’ zrenicky na svetelne podmienky. Tie sa menia 
aj zmenami na obrazovke pocitaca ci zmenou ciel’a pohl’adu. Navrhli sme a overili per- 
sonalizovany model predikcie zmien zrenicky z dovodu zmeny svetelnych podmienok, 
ktory berie do uvahy miesto, na ktore sa pouzivatel’ na obrazovke pozera a vypocita 
vnimanu svetelnost’, ktoru pouzivatel’ vnima a na zaklade toho predpoveda zmenu prie¬ 
meru zrenicky. Prve vysledky bob prezentovane na konferencii UMAP 2016 [5]. 

Meranie schopnosti vizualneho hTadania. Jednou zo strategii ako vysvetl’ovat’ 
spravanie pouzivatel’ov je modelovanie ich schopnosti. Napriklad schopnost’ vizual¬ 
neho hl’adania (schopnost’ pohl’adom lokalizovat’ prvok ci informaciu v rozhrani) uni- 
verzalne ovplyvnuje vykonnost’ pouzivatel’ov pri vykonavani uloh v aplikaciach, pre¬ 
dovsetkym na webe. Vytvorili sme test vizualneho hl’adania, pri ktorom s pomocou 
sledovania pohl’adu urcime uroven tejto pouzivatel’ovej schopnosti. Test je postaveny 
na rieseni umelych uloh vizualneho hl’adania ako aj uloh v realnych rozhraniach webo- 
vych stranok. V teste vyuzivame okrem metriky reakcneho casu aj rozne metriky sle¬ 
dovania pohl’adu, najma pocet fixacii pri rieseni ulohy. 

Detekcia zavadzania pri vyplnani dotaznikov. Zist’ovanie nepoctiveho spravania 
pouzivatel’ov ma vyznam v mnohych scenaroch, caste je pri vyplnani dotaznikov. Us- 
kutocnili sme experimenty, v ktorych sme nechali pouzivatel’ov vyplnat’ osobnostny 
dotaznik Big five, pricom raz mali participanti za ulohu vyplnit’ ho pravdivo a druhy 
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krat tak, aby sa co najviac pacili potencialnemu zamestnavatel’ovi. Cely priebeh vypl- 
nania sme zaznamenavali, vratane sledovania pohl’adu. Viacere metriky, ako napriklad 
cas prvej fixacie na odpoved’ a zmena priemeru zrenicky, sa ukazali byt’ indikativne pri 
rozlisovani nepoctiveho spravania pouzivatel’ov. 

Porovnavanie sedeni v ramci pouzivatel’skych stiidii. V kontraste s predchadza- 
jucimi vyskumami, pracovali sme aj na pristupe „zdola“, ktory sa snazi dat’ sekvenciam 
pouzivatel’skych akcii zmysel cez zhlukovanie (hl’adanie opakujucich sa vzorov spra¬ 
vania). Pristup vyuziva ucenie bez ucitel’a, konkretne obmedzeny Boltzmannov stroj. 
Tejto forme neuronovej siete su predstavovane fragmenty sedeni v podobe teplotnych 
map, v ktorych siet’ dokaze najst’ vhodne abstrakcie. Abstrakcie su nasledne vizualizo- 
vane do prehl’adnej schemy a pripravene na d’alsiu manualnu inspekciu [6]. 

3 Vyzvy v projekte HIBER 

Zacinajuci projekt sa zameriava na vyskum novych modelov a metod ziskavania a spra- 
covania informacii dolezitych pre lepsie pochopenie informacneho spravania cloveka 
v digitalnom priestore. Tieto modely a metody otvaraju priestor k zefektivneniu cin- 
nosti cloveka v digitalnom priestore najma zmiemenim dosledkov problemu kognitiv- 
neho pret’azenia informaciami v rozsiahlych digitalnych priestoroch. Prikladom moze 
byt’ odporucanie vhodnych informacnych zdrojov na zaklade automatickeho predpove- 
dania informacneho spravania cloveka v specifickej domene [4]. 

Medzi vyzvy, ktorymi sa zaoberame v predkladanom projekte, patria: 

1. Limity k\’antitativneho uvazovania nad indikatonni implicitnej spatnej vdzby. Tren- 
dom existujucich metod je analyza informacneho spravania l’udi na zaklade l’ahko 
meratel’nych signalov a za ignorovania mnozstva d’alsich faktorov, ktore spravanie 
l’udi mozu ovplyvnit’. Mozny smer je zapojenie hlbsich, kvalitativnych metod sku- 
mania informacneho spravania cloveka. 

2. Moznosti zapojenia novych signalov implicitnej spatnej vdzby. K „tradicnym“ indi- 
katorom ako kliky mysou, dopyty ci rolovanie sa dnes pridavaju dosial’ poriadne 
nepreskumane indikatory ako sledovanie pohl’adu, ci fyziologicke ukazovatele. 

3. Porozumenie spravaniupouzivatel’ov. Pozomost’ pri modelovani pouzivatel’a sa sus- 
tred’uje najma na jeho ciele a na obsah digitalnych priestorov, potrebne je vsak aj 
rozumiet’ spravaniu pouzivatel’ov v procese dosahovania danych ciel’ov. 

4. Skalovatel’nosf. Pre vsetky metody a modely zaroven plati potreba ich skalovatel’- 
nosti a teda ich prisposobenie principom vel’kych dat a distribuovaneho pocitania. 

Hlavnym ciel’om projektu je priniest’ nove poznatky v informatike a informacnych 
technologiach, najma: 

• Skumat’ nove fenomeny informacneho spravania cloveka a priniest’ nove poznatky 
spojene so spravanim sa cloveka v digitalnych priestoroch, v kontexte roznych situ- 
acii a typov zariadeni pre zber a poskytovanie informacii; 

• Skumat’ nove modely vystihujuce spravanie l’udi v digitalnych priestoroch; 
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• Navrhnut’ a verifikovaf nove metody ziskavania a analyzovania implicitnej spatnej 
vazby, predikcie informacneho spravania cloveka a jeho vyuzitia v zefektivnovani 
aktivit l’udi v digitalnych priestoroch (najma prostrednictvom personalizovanej na- 
vigacie a roznych foriem vizualizacie digitalneho priestoru). 

V projekte kladieme spolu s partnermi z Filozofickej fakulty UK v Bratislave doraz na 
interdisciplinamu analyzu dat (kvantitativnymi a kvalitativnymi pristupmi informatiky, 
informacnej vedy a psychologic). Po stranke technologickej sa orientujeme na nove 
zdroje spatnej vazby a technologie spracovania (prudov) vel’kych dat (potrebne na spra- 
cuvanie coraz vacsieho mnozstva surovych dat tecucich zo zdrojov spatnej vazby). Za- 
ber projektu je pomerne siroky a presahuje jednu oblast’ poznania. 

Rozbiehaj ucim sa projektom chceme prispiet’ k metodam a modelom hlbsieho po¬ 
znania spravania pouzivatel’ov v digitalnych priestoroch a k jeho automatickej interpre- 
tacii. Nadviazat’ tak chceme na dlhu vyskumnu tradiciu v oblasti analyzy spravania, 
modelovania pouzivatel’a a personalizacie. Do vyskumu zaroven zapojime pristrojovu 
a personalnu infrastrukturu laboratory Centra pouzivatel’skeho zazitku a interakcie, a 
nadviazeme na tu uskutocnovane existujuce projekty analyzy spravania. 

Pod’akovanie : Tento clanok vznikol vd’aka ciastocnej podpore Agentury podpory vedy 
a vyskumu v ramci projektu APW-15-0508. 
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deeper understanding of user behavior as well as research of new methods and models of auto¬ 
mated behavior interpretation. In this research, we especially take into account new sensoric 
sources of implicit feedback as well as traditional ones. 
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Abstrakt. Gamification has still an untapped potential in library environment. 
The reason is that the successful application of gamification is not an easy task. 
Instead of imitating the applications consisting of points, charts of winners or 
badges, it is important to reflect the specific user motivations in the specific en¬ 
vironment. With this aim, the empathy mapping method was applied in the design 
process, which was helpful mainly in terms of discovering user preferences of 
our target group in the fields of reading and gaming. The knowledge gained by 
this method was translated into the processes of web application using the user 
experience methods. It is believed that these methods can work as a bridge for 
the communication between the information science and computer science pro¬ 
fessionals and will help to accomplish the idea of a successful gamification ap¬ 
plication in libraries. 

Contribution type: Work-in-progress paper 

Keywords: library gamification, user experience methods, empathy map, custo¬ 
mer journey 


1 Introduction 

A simple definition of gamification is “the application of game elements and game 
thinking in a non-game environment” [7], The applications of gamification in a for- 
profit environment is often a successful method, how to attract new customers to the 
services in a playful manner. Therefore a right application of gamification may help to 
promote also libraries and reading. This research is unique, as gamification in libraries 
is a new domain [5], [6] and our user experience methods are also uniquely used in this 
area. 

There are various methods of user testing in the field of user experience: eyetracking 
or mouse tracking combined with think - aloud methods, card sorting and wireframing, 
logs and web analytics, social media analysis, competitive intelligence, interviews, 
ethnography and surveys. The application of the first six methods is successful, when 
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testing interfaces or prototypes of services that are already existing and the aim of a 
stakeholder is just to improve the existing solutions. The disadvantage of an ethnogra¬ 
phy research is that it produces a lot of bias and is requiring in terms of time. The 
interviews or surveys are often not successful in finding innovative solutions, as users 
often don’t know, what they need. It was neccessary to think beyond these methods to 
create new, useful and usable solution. 

The foundation of a good user experience is the knowledge about user, his problems, 
goals and needs. Deep understanding of users' behavior is crucial and since emotions 
influence behavior, either an understanding of users' emotions is important. 

The main target group of users of gamification application - university students was 
set. The reason is that they still have time and potential for reading and also many pla¬ 
yers can be found amongst this group. Creating personas of our target group was the 
next step, as designers using personas created 80 percent more ergonomic design than 
designers that didn't use them [4], 

2 Empathy map 

For deep understanding of user behavior, empathy, or the identification with the fee¬ 
lings, thoughts, or attitudes of another is needed. Also therefore an empathy mapping 
method was chosen for our user research. An empathy map is a tool, helpful in syn¬ 
thesizing the observations about the users and in drawing out unexpected insights. Em¬ 
pathy maps vary in shapes and sizes, but there are basic elements common to each one 
[ 1 ]: 

• Four quadrants broken into “Thinking,” “Seeing,” “Doing,” and “Feeling.” 

• Sticky notes covering the quadrants (different color for different user) 

• Additional boxes at the bottom of the quadrants: “Pains” and “Gains” (in some ver¬ 
sions) 

The first step in the process is the summarization of researcher's fieldnotes from user 
observations (sketches, audio/video files and photos) [ 1 ]. Instead of organizing the data 
in the quadrants by ourselves, the procedure included a direct brainstorming with users 
of our target group about their “Thinking,” “Seeing,” “Doing,” “Feeling” and “Pains” 
both in the fields of reading and playing games. Eight users that are playing and reading 
on a regular basis were selected for this qualitative user research. They were asked to 
do the exercise alone during the session by thinking about their favorite games and 
books, write it down on the sticky notes and put them to the appropriate quadrants on 
the empathy map. The most important part of the statement was the "because" part, 
where they were asked to explain their thoughts by a moderator. 

After a team brainstorming, the specific themes and key concepts started to emerge 
and the primary needs of the users were identified. These themes were than sorted into 
categories and visually organized to the mind map on the brainstorming session. This 
approach formed a cohesive vision of the future user experience that was visualized in 
the form of customer journey afterwards. 
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According to the empathy mapping research results, the most important for both 
gamers and readers was the feeling of identification with the character and the immer¬ 
sion to the story. Young readers are used to connect and compare the information they 
read with their reality and the text make them think about values. In games, some more 
active motivations were additionally mentioned as the possibility to evolve, discover, 
cooperate or to kill the enemies. Both in reading and playing the feeling of thrill and 
conflict has to be present. The senselessness (missing objective), bad graphics or diffi¬ 
cult text and cliche should be avoided. On the contrary, application should surprise 
users with the ability to build or to see something new. The gamers need to know the 
final objective so that they can strive to reach it and feel the adrenaline throughout the 
whole way. 

3 Customer journey 

A customer journey map is a visual interpretation of the overall story from an indivi¬ 
dual’s perspective of his relationship with a product, service or organization over time 
and eventually across channels [2], It allows to envisage interactions from the users’ 
points of view, instead of taking an inside-out approach. The journeys can be used in 
both evaluation of current or future product/ prototype. They are useful to examine the 
present points of delight and pain points and uncover the opportunities for building a 
better user experience with the aim to fit the products or services into the users’ lives. 
The basic components of customer journey are [2]: 

• Timeline: a finite amount of time or variable phases 

• Emotions: peaks and valleys illustrating emotions; or pain points and points of de¬ 
light of user experience 

• Touchpoints: customer actions and interactions with the application 

There are various forms of customer journeys, of which the left to right timeline is 
mostly used. The other variations are circular or helical maps that can be supplemented 
by pictures or multimedia. There are some templates available online, the Game tem¬ 
plate by Uxpressia 1 was adjusted for our needs in our research. 

4 Translating user needs to application attributes 

A dominant need that was identified during empathy mapping session was the immer¬ 
sion to the story. Also according to Kelway [3] the feeling of total immersion from the 
interaction with a product is particularly prevalent in the gaming world. It results in the 
feelings of joy, satisfaction and escapism from reality. The manifestation in games may 
be in challenges that can be overcome. Sensory experiences that are in balance with 
cognitive engagement seem to provide the best experience [3]. 

The answers to question, how exactly a user can be enveloped by the gamification 
application were found in the further answers of respondents. Some of them mentioned 


1 https://uxpressia.com/ 
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a conflict in game or in the fiction story, the others mentioned the ability of building 
something or destroying the enemy. The example of translation of one user need (con¬ 
flict) together with the solution is depicted in the customer journey (table 1). 


Tab 1. An example of translating user need to the system feature using the adjusted game 
template of customer journey ( the use phase of the timeline) 



USE 

USER NEEDS 

Conflict 

USER 

EXPERIENCE 

Development / weakening of soul force in fights 

TOUCHPOINTS 

Fight on battleground 

PROCESS 

Players can invite the others to a duel, if strong enough. 

Players with same amount of points from different groups will 
meet on the battleground 

PAIN POINTS 

Death (player cannot be killed, just weakened) 

PROBLEMS 

Too easy / difficult questions 

POINTS OF 

Winning = destroying the force of a competitor, the raise of win- 

DELIGHT 

ner's force of spirit 


5 Conclusion 

The methods, used in the research of gamification that shift the paradigm from self- 
centered to user-centered approach were explained briefly. Involving user preferences 
and problems in the first phases of information system design is crucial for a successful 
application. The gamification application will be created in cooperation of Departement 
of Library and Information Science (Faculty of Arts, Comenius University in Brati¬ 
slava), Faculty of Informatics and Information Technologies (Slovak Technical Uni¬ 
versity in Bratislava) together with the volunteers (book reviewers and graphic de¬ 
signers) as a crowdsourcing project. Our methods were considered as a helpful commu¬ 
nication bridge by the partners for writing the specification of the system and for its 
development. Still, further elaboration in the form of detailed wireframes is needed and 
is currently in the process. The resulting prototype will be evaluated with our target 
group and than implemented. The gamification application is planned to be deployed 
in Slovak libraries as a part of a particular library and information system. It is believed 
that the cooperation of social and computer sciences would be successful through the 
combination of their two different approaches. 
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We consider user-item preference represented by a rating and deal with content based 
recommendation. We present our results on preference learning - models, methods, 
prototypes, data, metrics and experiments already published in [1, 2, 3, and 4] and add 
some material presented at EURO 2016 Preference Learning Stream [5], 

We represent preferences on a set O of objects by rating (scoring) function 
r:V->[0;l], which assigns to every object oeV its overall preference score 
r(o) 6 [0;1], This score has pure comparative interpretation. We say an object oi is more 
preferred than object 02 if r(oi) > r( 02 ). Respectively, we consider O c ITDi the data 
cube (we freely switch between O and ITDi). 

Assume further, we have a set of users U and for each user u e U the set V 11 c O of 
by him/her visited objects and corresponding observed rating r“: V 11 [0; 1], Practi¬ 

cally V 11 is much smaller than O. Here we deal with offline testing (for online testing 
see [6]) and visited objects are divided V 11 = Wain u Wst to disjoint union of training 
and testing examples (with repeating cross validation). This implies, we have also 
Strain* VVain -¥ [0; 1] and r test. V test [0, 1]. Our task is to find a recommendation in 
the form of a total rating r u e : TIDj [0; 1] such that r u e is a good estimation of r u test (in 
the sense of some metric, distance or order agreement (as r u e induces an ordering < e )). 

In [1] we described our team approach to RuleML 2015 Rule-based Recommender 
Systems for the Web of Data Challenge Track. The task was to estimate the top 5 mov¬ 
ies for each user separately in a semantically enriched MovieLens 1M dataset measured 
by F-measure. We presented three methods. Surprisingly, the best recommendation was 
a domain specific method like "recommend for all users the same set of movies from 
Spielberg”. Our main contributions were domain independent data mining methods tai¬ 
lored for top-k which combine second order logic data aggregations and transfor¬ 
mations of metadata. 

In [2] we introduced monotone preference models, i.e. models where r u e is a mono¬ 
tone composition of rankings on domains of explanatory attributes (possibly describing 
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user behavior, item content but also data aggregations). Target preference ordering of 
users on items is given by a preference indicator (e.g. purchase, rating). In this paper 
we focused on learning the (partial) order of vectors of rankings of user-item attribute 
values (without aggregation). We measure degree of agreement of comparable vectors 
with ordering given by preference indicators for each user. We are interested in distri¬ 
bution of this degree across users. We provide sets of experiments on user behavior 
data from an e-shop and on a subset of the semantically enriched Movie Lens 1M data. 

In [3] we made a step further. We assume having explicit ratings with time-stamps 
from each user. We integrate three different movie data sets, trying to avoid features 
specific for single data and try to be more generic. We use several metrics which were 
not used so far in the recommender systems domain. Besides classical rating approxi¬ 
mation with RMSE and ratio of order agreement we study new metrics for predicting 
Next-k and (at least) 1-hit at Next-k. Using these Next-k and 1-hit metrics we try to 
model display of our recommendation - we can display k objects and hope to achieve 
at least one hit. We trace performance of our methods and metrics also as a distribution 
along each single user. We define transparent and complicated users with respect to 
number of methods which achieved at least one hit. We provide results of experiments 
with several combinations of methods, data sets and metrics. 

In [4] we wanted to test new methods and metrics. For this we designed a simulation. 
For instance first hit is the step in which a top-k item (from test set) appears in our 
estimation (the smaller the number the better). The ideal we would like to reach is to 
have for all users top-10 with first hit in estimated top-10. Unfortunately we are far 
from this. We depict results for parallel 1 st hit measure, i.e. we consider test set (golden 
standard) ordering < of items and estimation of ordering < e , then the parallel 1 st hit is 
the minimal position k in which top-k(<) n top-k(< e ) * 0 . 

Second distinguished feature is that we measure quality of our prediction for each 
user separately. Results (of our method from [4]) are depicted in Figure 1 with box plot 
and with 5% and 95% percentile (visualization is cut at step 100). 

We show results for generated data - two sorts of users (with either triangular shaped 
or bell shaped users’ preference) and several types of data density (probability of ex¬ 
plicit rating of an item in train set and in test set). We can see that in general triangular 
shaped users are easier to recommend than bell shaped users (e.g. for triangular users 
(in contrast with bell shaped) all medians are below 50, that is more than 50% of user 
has 1 st parallel hit earlier than in the step 50 (if a web page depicts 10 results then there 
is a hit not later than on 5 th page), only two groups with lower probability of rating are 
cut by step 100 (more than 25% of users has hit after step 100)). Second, we can see 
that users with higher probability of rating (rated 1 to 5 percent of items) are easier to 
recommend than those with lower probability (which rated 1 to 5 mille of items, i.e. 10 
times less). 

Nevertheless, we also see that comparing box plots generates a partial order - one 
group has better first quartile but median is worse. It is probably a task for business to 
say what is more relevant. We do not deal with implicit user behavior here. In general, 
our strategy is to interpret user’s behavior as (fictitious) explicit rating, see also [6], 

In [5] we considered user habits, how many of them are visiting second, third page 
of recommendation (we assume page is displaying 10 items). 
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Consider data from Table 1. 1 st hit below 10, means hit on first visited page, 1 st hit 
below 20 (above 10), means hit on second page, ... p; - % of users with 1 st hit < i* 10, 
qi - % of users visiting i th page. We calculate aggregated success measure = pi*qi + 
P 2 *q: + P 3 *q 3 , results are depicted in Figure 2. 


Parallel 1st - hit Measure 



Fig. 4. Typical result for different user types, see also [4], 

In [4] we introduced a new pivot based method. Pivot based method is better that re¬ 
maining when measured in # in top-50 (1st hit). Results show that our data mining 
method is better than pivot when measured in RMSE 


Tab. 1 Various user habits in visiting pages of recommended objects in ratio of users. 



data 1 

data 2 

data 3 

data 4 

data 5 

1 st page 

100 

100 

100 

100 

100 

2 nd page 

9 

86.8 

92.04 

86.72 

51.4 

3 rd page 

4.5 

14.4 

5.27 

4.5 

0 


We made also several other comparisons. E.g. RMSE for random users - model- 
based vs. pivot based 3D methods compared show that our data mining model again 
gives better results than pivot based method. 

We measured also sizes of intersection of test and estimated ordering at top-k. Our 
data mining was significantly less effective that pivot based method. 

We report also on results of defended PhD, Master thesis in our seminar - especially 
on learning from implicit user behavior and/or online experiments. 

A side product of our approach, the use pivots for collaborative filtering can contrib¬ 
ute to cold start problem. 

Future work is oriented in two directions. First is to improve results of this introduc¬ 
tory investigation by further, more complex experiments both on artificial (more than 
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3D data) and both on real data. Second, we find interesting to study pivots based in¬ 
dexes on the space of observed ratings with respect to different metrics (distances, 
measures), especially order sensitive. 


Aggregated Success for Data Density Groups 




Data Density (non-missing data ratio) Data Density (non-missing data ratio) 


Fig. 5. Results from [4] recalculated with habits from Table 1, coefficients and colors corre¬ 
spond to columns in the table. 
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Abstrakt. Vyvoj softverovych systemov je zlozity proces, ktoreho sa zucastnuju 
viaceri vyvojari pracujuci na roznych ulohach. Navyse kazdy vyvojar ma vlastne 
zvyky a postupy riesenia uloh, ktore priamo ovplyvnuju vlastnosti vyvljaneho 
zdrojoveho kodu. Sucasne nastroje pre podporu vyvoja softveru ale sleduju akti- 
vity vyvojarov len v minimalnej miere, cim prichadzaju o vyznamne data, ktore 
umoznuju lepsie urcit’ vlastnosti vyvljaneho softveru a zlepsit’ sledovatel’nosf sa- 
motneho vyvoja softveru. Ciastocne riesenie tohto problemu priniesla infrastruk- 
tura vyvinuta v ramci projektu PerConlK. Tato infrastruktura ale bola sita na 
raieru vyskumneho projektu a neumoznovala jej distribuciu otvorenej komunite 
vyskumnikov za ucelom zberu a spristupnenia dat. V tejto praci predstavujeme 
system DevACTs vychadzajuci z infrastruktury projektu PerConlK. Tento sys¬ 
tem riesi problemy infrastruktury projektu PerConlK a navyse je rozsiritel'ny 
o d’alsie zdroje dat o aktivite vyvojarov so zdrojovym kodom, ako sledovanie 
pohl'adu vyvojarov alebo ich fyziologie. 

Typ prispevku: Aplikacny prispevok 

Kl’iicove slova: sledovatel’nost’ vyvoja softveru, aktivity vyvojarov, zdrojovy 
kod 

1 Motivacia 

Analyza softverovych projektov je jednjm zo zakladnych prvkov manazmentu softve¬ 
rovych projektov, ktorej ciel’om je identifikovat’ preco nastali rozne negatlvne udalosti 
pocas riesenia projektu [2], Na zaklade tejto analyzy mozu projektov! manazeri upravit’ 
existujuce procesy vo vyvoji softveru, prlpadne definovat’ nove, aby zabranili vzniku 
identifikovanych negatlvnych udalosti. Pri tejto analyze su vyuzlvane najma softverove 
metriky vypocltane zo stabilizovanych verzil zdrojoveho kodu, vysledky testov, prl- 
padne zaznamy zo systemov spravy uloh. Tieto data poskytuju iba informacie o uda- 
lostiach, ktore nastali, ale poskytuju len minimalne informacie o procese ako tieto uda- 
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losti nastali. Chybajuce informacie o procese vzniku mozu doplnit’ empiricke softve- 
rove metriky, ktore su vyhodnotene z aktivit vyvojarov pocas prace na softverovych 
artefaktoch. 

Aj napriek schopnosti empirickych softverovych metrik odpovedat’ na otazky 
„Preco?“ (Preco je v zdrojovom kode tol’ko chyb? Preco doslo k prekroceniu casoveho 
planu?) su tieto metriky vyuzivane len v minimalnej miere. To je sposobene najma 
tromi problemami [3]: 

• Zber empirickych dat je drahy na cas a zdroje - vel’a dat je zbieranych manualne 
vyvojarmi; 

• Kvalita zbieranych empirickych dat - manualne zbierane dat obsahuju vel’a chyb 
a su pomeme riedke; 

• Pouzitel’nost’empirickych dat- je definovanych len malo empirickych softverovych 
metrik a malo nastrojov, ktore umoznuju ich interpretovanie a ich analyzu. 

Problem pouzitel’nosti empirickych dat vyzaduje definovat’ nove empiricke softverove 
metriky. Pokusy o navrh empirickych softverovych metrik [4, 5] boli aj v projekte Per- 
ConlK 1 [1], Pri navrhu tychto metrik sa ale ukazali problemy zberu a kvality empiric¬ 
kych dat. V projekte PerConIK sa podarilo pomocou navrhnutej architektury zozbierat’ 
pomerne vel’ke mnozstvo dat o aktivitach vyvojarov a zdrojovych kodoch. Tieto data 
ale pochadzaju iba z priblizne 5 projektov od obmedzenej vzorky vyvojarov - 15 pro- 
gramatorov v softverovej firme a 20 studentov. Navyse zozbierane data su pomerne 
riedke, kedze pocas zberu sa vyskytlo viacero chyb, ktore viedli k strate casti zbiera¬ 
nych informacii. Dosledkom takychto dat je, ze prvotne hypotezy navrhovanych metrik 
casto nebolo mozne potvrdit’ ani vyvratit’. 

2 Projekt DevACTs 

Projekt DevACTs (Developer’s Activity, Code and Tasks) je priamym pokracovanim 
projektu PerConIK, pricom jeho ciel’om je vytvorit’ zdiel’anu infrastrukturu pre zber 
a analyzu empirickych dat o vyvoji softveru. Infrastruktura projektu DevACTs tak sta- 
via na infrastrukture projektu PerConIK, pricom sa snazi riesif problemy, ktore zne- 
moznuju jej nasadenie v d’alsich instituciach a komplikuju zber od vyvojarov: 

• Decentralizovana sprava pouzivatel’ov - infrastruktura projektu PerConIK nevyu- 
ziva centralnu spravu pouzivatel’ov. Kazdy podsystem ma vlastnu autorizaciu pou¬ 
zivatel’ov, pricom viacere systemy su silne zviazane, s konkretnym nasadenim pro- 
tokolu LDAP. Tato decentralizacia pristupov k systemom znemoznuje nasadenie na 
inych instituciach a znemoznuje jednoznacne parovanie pouzivatel’ov v zbieranych 
datovych mnozinach; 

• Anonymity zber aktivit vyvojdrov - prostredie, v ktorom bola infrastruktura nasadzo- 
vana umoznovalo len uplne anonymny zber aktivit od vyvojarov. Ich mapovanie na 


i 


http://perconik.fiit.stuba.sk/ 
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zmeny v zdrojovom kode tak muselo prebiehat’ na zaklade druhotnych vlastnosti, co 
viedlo k chybam v datovej mnozine; 

• Zlozitost’ nasadenia - webove sluzby projektu PerConIK si vyzaduju zlozitu konfi- 
guraciu prostrednictvom konfiguracnych XML suborov, pricom jednotlive konfigu- 
racie prebiehali duplicitne a navzajom sa ovplyvnovali. Chybne nastavenie niektorej 
z hodnot tak casto viedlo k znefunkcneniu celej infrastruktury. Navyse nasadenie 
monitorovacich nastrojov u vyvojara pozostava zo samostatnych instalacii viace- 
rych nastrojov a chyba akakol’vek podpora pre automaticke aktualizacie. 

Na zaklade tychto identifikovanych problemov sme podrobili existujucu implementa- 
ciu refaktorovaniu a odstraneniu znamych chyb. Nasledne sme implementovali cen- 
tralny administracny system, ktory vzajomne integruje ostatne podsystemy a zastava tri 
hlavne ulohy: 

• Centralizacia konfiguracie - kazdy podsystem je konfigurovany prostrednictvom 
weboveho rozhrania, ktore zabezpecuje kontrolu zadanych hodnot a odstranuje re- 
dundancie v konfiguraciach; 

• Diagnostika infrastruktury - centralny system zbiera zaznamy o udalostiach v jed- 
notlivych podsystemoch, co umoznuje administratorom jednoducho a vcasne iden- 
tifikovat’ problemy; 

• Sprava pouzivatel’ov - centralny system je zodpovedny za autentifikaciu a autoriza- 
ciu pouzivatel’ov. Ostatne podsystemy su tak oslobodene od potreby ukladania citli- 
vych udajov a riesenia zakladnych pristupovych prav. Taktiez administratori maju 
moznost’jednoducheho nastavenia prav pouzivatel’ov z jedneho miesta, vd’aka cornu 
je mozne zabranit’ neautorizovanemu pristupu k datam, ako aj zasumeniu zbieranych 
dat od neautorizovanych osob. 

Okrem vyvoja centralneho systemu, ktory umoznuje jednoduche nasadenie infrastruk¬ 
tury na novych instituciach sme sa sustredili aj na zlepsenie podpory zberu dat od vy- 
vojarov. K tomuto ucelu sme prepracovali pouzivatel’ske rozhranie klientskej aplikacie 
a implantovali system aktualizacii a instalator, ktory jednoducho prevedie vyvojarov 
instalaciou potrebnych nastrojov. 

3 Zavery a buduca praca 

Upravy v infrastrukture DevACTs nam davaju momost’ zberu cistejsich dat o aktivi- 
tach vyvojarov a nasadenie tohto zberu nie len v d’alsich instituciach, ale aj nasadenie 
monitorovacich nastrojov u nezavislych jednotlivcov, pripadne timov, ktore sa chcu 
podiel’at’ na zbere dat o vyvoji softveru. Vd’aka tymto datam sa tak budeme moot’ pl- 
nohodnotne sustredit’ na vyskum novych empirickych softverovych metrik, ktore 
umoznia lepsie porozumenie problemom v softverovych projektoch. 

Do buducna sa v projekte DevACTs planujeme zamerat’ na podporu vyvojarov 
a projektovych manazerov, pomocou ktorej budeme mdct’ poskytovat’ vystupy empi¬ 
rickych softverovych metrik. Tato podpora je v sucasnosti v projekte reprezentovana 
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prostredmctvom systemov CodeReview a CORD. System CodeReview poskytuje vy- 
vojarom priestor na posudzovanie zmien v zdrojovych kodoch, pokial’ system CORD 
sa zameriava na podporu manazmentu prostredmctvom vizualizacie artefaktov zdrojo¬ 
veho kodu a ich vlastnosti. Napriektomu, ze oba systemy pracuju so zdrojovym kodom, 
boli vyvljane roznymi timami, vyuzivaju rozne principy poskytovania rovnakych, resp. 
podobnych informacii. Tieto systemy planujeme spojit’ do jedneho systemu a minima- 
lizovat’ tak zat’azenie vyvojarov a projektovych manazerov pracou s dvomi roznymi 
systemami. 

Pod’akovanie: Tato publikacia vznikla vd’aka ciastocnej podpore projektov VG 
1/0752/14, VG 1/0646/15 a projektu v ramci OP Vyskum a vyvoj pre projekt: Vyskum 
metod ziskavania, analyzy a personalizovaneho poskytovania informacii a znalosti, 
ITMS: 26240220039, spolufmancovany zo zdrojov Europskeho fondu regionalneho 
rozvoja. 
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Annotation: 

DevACTs: Collecting and Evaluating Developers’ Activities, Tasks and Source Code 

Software system development is a complex process, to which many developers are involved. All 
these developers work on different tasks and they have different habits and ways how to solve 
their tasks. It directly influences characteristics of developed source code. Current software de¬ 
velopment tools support monitoring developers’ activity, though minimally on their own, so that 
we miss important data needed for detailed analysis and evaluation of software projects. In¬ 
frastructure of the research project PerConIK partially solved this problem. However, this infra¬ 
structure was proposed only for specific research environment and it is not deployable for multi¬ 
ple teams of developers. In this paper we present infrastructure of the project DevACTs inspired 
by the project PerConIK. The DevACTs infrastructure solves problems of the original environ¬ 
ment and it is extendable with new monitored events, e.g., gaze tracking or ECG. 
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Abstract. Information behavior of researchers in contexts of open science and 
digital scholarship is considered for understanding information needs and chang¬ 
ing information infrastructures for scholarly communication. Our main research 
questions are: Which components build the conceptual framework for modeling 
information environment of digital scholarship? Which differences in infor¬ 
mation behavior of researchers of different disciplines can we identify? An anal¬ 
ysis of models of digital scholarship is presented as the context of the research. 
A qualitative study of information behavior of 19 selected researchers is outlined 
based on semi-structured interviews. First results of content analyses are pre¬ 
sented, including common general methodological approaches and different in¬ 
formation interactions and publishing. In conclusion an ecological model of re¬ 
search information interactions is explained, composed of expertise factors, 
methodological factors, and open science factors. 
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1 Introduction 

Digital science is related to the transformation of creative scholarly communication and 
information processes into digital environments. Digital technological developments 
and digital data deluge have changed information behavior and information interac¬ 
tions. New types of documents and genres have emerged in digital environments, rang¬ 
ing from blogospheres to mobile digital libraries. Open science refers to research pro¬ 
cesses based on transparent information practices regarding methods, data, results and 
democratic access to knowledge and which allow broader public access to research re¬ 
sults. Open science includes open access to scholarly literature, open data, open insti¬ 
tutional repositories and electronic journals. 
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In our research we would like to understand information needs of scholars and 
changing information infrastructure of scholarship. Our main research questions are: 
Which components build a new conceptual framework for modeling information envi¬ 
ronment of digital scholarship? Which differences in information behavior of research¬ 
ers from different disciplines can we identify? Which patterns of digital information 
use and publishing are typical for them? How should we design new services, tools and 
systems for researchers as part of knowledge infrastructure of digital scholarship? 


2 Models of open science and digital scholarship 

We analyzed several models of digital science and social networks of research workers. 
Hurd [1] outlines the most important changes in the information process based on dig¬ 
ital libraries (Fig. 1). Whitworth and Friedman [2] presented rich information interac¬ 
tions between authors, editors, web publishers, reviewers and readers that have changed 
the traditional information environment. 
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Fig. 1 . Scientific Communication: A model 2020 (Hurd, 2000, p. 1281) 


Bjork [3] designed an updated version of information flows in scholarly communica¬ 
tions. Borgman [4] (Fig. 2) presented a scientific life cycle perspective on information 
flows based on the analysis of big data, its management and infrastructure. New topics 
open doors to exploration of human data interactions. 
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Fig. 2. Scientific life cycle example from the Center from Embedded Networked Sensing 

(Borgman, 2015, p. 265) 

Chowdhury [5] presents a model of sustainable digital services based on sustainable 
information environment (Fig. 3). 



Fig. 3. Research issues and challenges in sustainable digital information services (Chowdhury, 

2014, p. 195) 
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As a result of analyses of the models we can identify three basic components in digital 
and open scholarship: users and producers; knowledge infrastructure; and content, in¬ 
cluding artifacts and value-added services. They provide a common contextual back¬ 
ground for conceptual modeling of information behavior of researchers ([6], [7], 8], 

[9]). 

3 Information behavior of researchers: a qualitative study 

In the framework of a research project on digital scholarship we carried out a qualitative 
study into the information behavior of 19 selected researchers in Slovakia. The main 
research question was focused on determination of domain differences with regard to 
information behavior of researchers and their perceptions of open science. We applied 
the methodology of semi-structured interviews. 

A conceptual map was developed as a methodological tool for semi-structured inter¬ 
views, content analyses and further conceptual modeling (Fig. 4). 



Fig. 4. Methodological design of the study (conceptual map) 

The participants of the study included selected 19 researchers in sciences and medicine, 
humanities, social sciences and computer science in Slovakia. The selection criteria of 
subjects were based on the expertise and excellence in the domain, international net¬ 
works, use of big data, advanced technologies and unique characteristics of the disci¬ 
plines. The 19 respondents included 13 males (68,4 percent) and 6 females (31,6 per¬ 
cent), the average age was 54,4 and the average number of years of professional expe¬ 
rience was 30 years. The representation of disciplines was composed of humanities (8, 
39 percent), sciences and medicine (5, 28 percent), social sciences (4, 22 percent) and 
technical sciences (2,11 percent). An averarge duration of an interview was 72 minutes. 
The interviews were carried out since October to December 2015 and since January to 
May 2016. The data were coded and frequencies of derived categories were interpreted. 
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Deeper semantic analyses are going on with the use of concept modelling and multiple 
analyses of different researchers in order to ensure the validity of results. 

3.1 Results of first analyses 

Relationships of scholarship with broader public, transparency of research processes 
and open access to data and publications were analysed. If we are to understand the 
information practices of open science in social, technological and community dimen¬ 
sions, we need to re-conceptualize the concept of research information interactions. 
Research information interactions can be determined as complex relationships between 
researchers and information environment. Following the ACRL Framework [10] we 
can determine research information literacy as the ability to understand and use infor¬ 
mation in order to carry out research in disciplines. However, not very much attention 
was paid to perceptions of open science and digital scholarship. That is why we ana¬ 
lysed the data in relation to factors of open and digital scholarship. 

These analyses point to common patterns and disciplinary differences in perceptions 
of knowledge infrastructure. Common patterns revealed common critical analytical in¬ 
formation practices (information fluency). Practical experience and expertise is mani¬ 
fested by reliance on authoritative information sources and personal international ex¬ 
pert networks. Open science factors were identified by researchers, especially promo¬ 
tion of results and open access. It is also connected with international participation, 
collaboration, peer networking, and information sharing (17 subjects). Technological 
determination, special methods and software tools were found especially with “big 
data” sciences, i.e. astrophysics, physics, genetics, archaeology, social sciences. In hu¬ 
manities, the tendency towards building digital collections and digital libraries was 
noted (e.g. archival system PamMap, Slavic languages atlas, archaelogical photo¬ 
graphic digital collections). Further open science factors included policies, evaluation 
of results, access to data and publishing. Awareness of researchers' social networks has 
been noted, including alternative metrics (altmetrics). Main differences emerged from 
domain-specific research objects, research statements, methodologies, procedures and 
data management. These differences are reflected in publishing activities (humanities: 
monographs, sciences: journals), communication, information use and culture of disci¬ 
plines. Methodological modes of social sciences, humanities, sciences and technical 
sciences were identified. 

4 An ecological framework of research information interactions 

Based on results of analyses we developed an ecological framework of encapsulated 
research information interactions, composed of methodological factors, open science 
factors and expertise factors (Fig.5). Factors of open science (OS) include promotion, 
open access and participation. Several gaps with regard to open science were identified, 
namely the awareness of open access (OA) potential and promotion of research. The 
diagram represents intersections of processes which are relatively independent and in 
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mutual interactions create the holistic ecology of information interactions. These fac¬ 
tors were derived from the content analysis of semi-structured interviews. The main 
insight is that methodological and OS factors are common to disciplines, while differ¬ 
ences are based on expertise. 


Open science factors 



Fig. 5. The ecological framework of research information interactions 


5 Conclusion 

Models of digital scholarship and open science proved the need for deeper research into 
information needs of researchers. Based on this we developed a conceptual map which 
was used as a methodological tool for the qualitative study of information behavior of 
19 researchers in Slovakia. Many differences among disciplines have been proved (e.g. 
retrospective nature and broad context of humanities, perspective nature and narrow 
context of sciences, specific methodologies, types of data and practices). Following the 
first analyses the ecological framework of encapsulated research information interac¬ 
tions was presented. We identified three groups of factors in information behavior of 
researchers, i.e. the expertise factors, methodological factors, and open science factors. 
Based on this we can determine research information literacy as understanding, sense 
making and knowledge discovery integrated with motivation and research interests. 
Our framework can be useful for development of knowledge infrastructures, including 
systems and services which actively support researchers in information practices, com- 
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munication and collaboration. Perceptions of open science can help reconstruct effi¬ 
cient partnerships between researchers, information professionals, librarians, research 
managers, institutions and research agencies. 

Research information interactions can lead to changes in the workflow of the re¬ 
search and information processes and new models of digital environments for research¬ 
ers. Support of information activities and creativity is needed in online genres and re¬ 
search communities of practice. Several components of digital environment (data, sys¬ 
tems, tools, services) can contribute to new models of research and information pro¬ 
cesses. Further practical implications can be derived for value-added services and dig¬ 
ital tools for researchers. 
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Abstrakt. V sucasnej dobe su personalizovane odporucania vel’mi populame a 
coraz viac vyuzivane. Avsak jeden zo zakladnych problemov v tomto kontexte 
je nedovera pouzivatel’ov v odporucacie systemy. Povazuju ich za narusenie ich 
sukromia z dovodu vyuzivania pomerne osobnych informacii. Preto je dolezite, 
aby bob odporucania pre pouzivatel’ov transparentne a zrozumitelne. Pre riesenie 
tychto problemov sme sa rozhodli zamerat' na oblast’ prezentacie vysledkov od¬ 
porucania. Konkretne sme sa zaoberali vysvetl’ovanim jednotlivych poloziek od¬ 
porucania koncovemu pouzivatel’ovi. V tomto kontexte sme vytvorili odporucaci 
system ExplORe, ktory vyuziva kolaborativne filtrovanie ako standardnu tech- 
niku odporucania. V ramci tohto systeme sme navrhli a implementovali hybridnu 
metodu personalizovaneho vysvetl’ovania. Tato metoda je nezavisla od techniky 
odporucania a kombinuje tri zakladne pristupy k vysvetl’ovaniu s ciel’om poskyt- 
nut’ pouzivatel’ovi vhodny typ personalizovaneho vysvetlenia. Jednotlivymi pri- 
stupmi su vysvetl’ovanie zalozene na podobnych pouzivatel’och, vysvetl’ovanie 
zalozene na obsahu a vysvetl’ovanie zalozene na znalostiach o pouzivatel’ovi. 

Typ prispevku: Vyskumny prispevok 

Kl’iicove slova: personalizovane odporucanie, vysvetl’ovanie odporucani, spra- 
vodajska domena 


1 Uvod 

Zavaznym problemom, ktory vo v>raznej miere brani vacsiemu rozsireniu a uplatneniu 
odporucani je casto nedovera pouzivatel’ov. Tato nedovera prameni aj z toho, ze odpo¬ 
rucania vyuzivaju pomerne osobne informacie o pouzivatel’och. Tieto informacie sa 
mozu tykat’ ich znalosti, spravania sa na stranke alebo urcitych socialnych charakteris- 
tik (priatelia, komunity, a pod.). Tieto systemy teda produkuju odporucania bez neja- 
keho blizsieho vysvetlenia alebo interpretacie dovodu preco je dane odporucanie pre 
pouzivatel’a vhodne. 

Zaujimavym pristupom krieseniu takehoto problemu je rozna forma prezentacie od¬ 
porucani. Takato prezentacia moze suvisiet’ napriklad s: 
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• Umiestnenim a vizualizaciou odporucani tak aby zaujali pouzivatel’a, boli viditel’ne 
a pouzitel’ne. 

• Vysvetl’ovanim samotnych odporucani a procesu ich generovania, ktore sa snazia 
priblizit’ odporucania pouzivatel’ovi a tym riesit’ problem s ich nedoverou. 

Vysvetlenia su orient ovane priamo na pouzivatel’ov a snazia sa im opisat’ dovody preco 
by dane odporucania mohli byt’ pre nich napomocne [3], Taketo vysvetlenia vsak ne- 
riesia a ani nemaju riesit’ problemy s nespravnost’ou odporucani. Ich ciel’om je iba po- 
dat’ odporucanie vo forme, ktora bude pouzivatel’om blizsia. 

Z dovodu pretrvavajucich problemov s prijimanim odporucani samotnymi pouziva- 
tel’mi je nasou snahou zamerat’ sa na lepsie podanie jednotlivych objektov odporucania 
koncovemu pouzivatel’ovi. Chceme aby odporucania neposobili odstrasujuco, ale aby 
boli naopakprijimane ako pomoc alebo podpora. Tomu sme prisposobili nielen metodu 
vysvetlenia ale aj samotnu prezentaciu alebo vizualizaciu odporucani. 

V ramci snahy zaujat’ pouzivatel’a je teda na jednej strane dolezite mat’ kvalitny al- 
goritmus pre odporucania avsak rovnako dolezite je aj tieto odporucania prezentovat’ 
co najlepsim sposobom [1], Niektore vyskumy dokonca ukazuju, ze samotna prezenta¬ 
cia je v urcitych pripadoch dolezitejsia ako technika odporucani [4]. V oboch pripadoch 
vsak ide o snahu vytvorit’ uzitocny system z hl’adiska pouzivatel’a. V tomto pripade su 
zakladnjmi bodmi pre dosiahnutie uzitocneho odporucacieho systemu [2]: 

• Potreba vyvolania dovery v pouzivatel’och 

• Transparentnost’ systemu vzhl’adom na pouzivatel’a 

• Doplnujuce informacie o odporucaniach (obrazky, hodnotenia, a pod.) 

2 Metoda hybridneho vysvetl’ovania 

Hlavnu myslienku navrhnutej metody personalizovaneho vysvetl’ovania predstavuje 
pristup generovania vysvetleni s ohl’adom na preferencie pouzivatel’ov. Kazdy pouzi- 
vatel’ teda ma k dispozicii vysvetlenia prisposobene tak aby ho co najviac zaujali. 

Navrhnuta metoda je nezavisla od techniky akou bol odporuceny dany clanok. Na 
druhej strane vsak vychadzame z pristupov k odporucaniu pri generovani vysvetleni. 
To znamena, ze jednotlive vysvetlenia su ako keby polozky odporucania. Takymto spo¬ 
sobom na zaklade informacii o odporucenom clanku a informacii o pouzivatel’ovi vy- 
generujeme alebo odporucime vysvetlenie, ktore bude vhodne a zaujimave. 

Vysledna metoda predstavuje urcity typ hybridneho personalizovaneho vysvetl’ova¬ 
nia. Hybridne preto, lebo kombinuje viacere pristupy tak aby sme dosiahli optimalny 
vysledok. Tymito pristupmi su: 

• Vysvetl’ovanie zalozene na podobnych pouzivatel’och 

• Vysvetl’ovanie zalozene na obsahu clankov 

• Vysvetl’ovanie zalozene na znalostiach o pouzivatel’ovi 

Personalizovane zase z toho dovodu, ze kazdy pouzivatel’ ma v ramci daneho pristupu 
zobrazenu konkretnu informaciu, z ktorej bolo vysvetl’ovanie odvodene: 
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• Vysvetl’ovanie zalozene na podobnych pouzlvatel’och - v tomto pripade bude zob- 
razeny konkretny pouzlvatel’ vyuzity pre vysvetlenie. 

• Vysvetl’ovanie zalozene na obsahu clankov - v tomto pripade bude zobrazeny kon¬ 
kretny clanok vyuzity pre vysvetlenie. 

• Vysvetl’ovanie zalozene na znalostiach o pouzlvatel’ovi - v tomto pripade bude zob- 
razena konkretna znalost’ vyuzita pre vysvetlenie. 

V kontexte hybridneho vysvetl’ovania je potreba kombinacie tychto prlstupov. Takto 
poskytneme typ vysvetlenia, ktory sa hod! pre daneho pouzlvatel’a. Pri tejto metode 
najskor prostrednlctvom monitorovania preferencil pouzlvatel’a najdeme taky typ vy- 
svetl’ovania, ktory je vhodny v kontexte vlastnostl daneho clanku a ktory je zaroven 
vhodny aj pre daneho pouzlvatel’a. Nasledne sa urcl, ktory typ vysvetlenia sa hod! pre 
konkretneho pouzlvatel’a. 

3 Overenie a zaver 

Navrhnutu metodu vysvetl’ovania sme implementovali ako sucast’ systemu ExplORe. 

V tomto systeme prebiehal dlhodoby experiment so simulovanlm podmienok realneho 
media s novinovymi clankami. V ramci tohto experimentu sme zbierali udaje o aktivite 
pouzlvatel’ov v systeme. Dlzka trvania tohto experimentu bola 18 dm. Experimentu sa 
celkovo zucastnilo 17 l’udi. Celkovo 13 znichtvorili studenti vysokoskolskeho studia. 
Az 15 ucastnlkov bolo vo veku 20-30 rokov. Experiment bol rozdeleny na dve casti 
kedy cast’ pouzlvatel’ov vysvetlenia k dispozlcii nemala a druhej skupine vysvetlenia 
ponuknute boli. 

Vysvetlenia mali v oboch skupinach pouzlvatel’ov pozitlvny vplyv na mieru ich ak- 
tivity v systeme a teda aj logicky na samotnu presnost’ odporucanl, ktore im boli gene- 
rovane rovnakou metodou. Na Obrazku 1 je jasne vidiet’, ze oboch prlpadoch je zazna- 
menany narast poctu klikov na clanky s vysvetleniami v kontexte presnosti. 



Obr. 1. Presnost’ klikov medzi skupinami pre clanky bez a s vysvetleniami. 
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Hybridna metoda personalizovaneho vysvetl’ovania dosiahla pomeme pozitlvne vy- 
sledky vo viacerych oblastiach nasho vyskumu. V tomto kontexte preto vidlme aj po¬ 
meme vel’ky potential d’alsieho vyskumu v tejto oblasti. V praci sme sa snazili prispo- 
sobit’ vysvetlenie konkretnemu pouzlvatel’ovi. Avsakrovnako zaujlmave je pokusit’ sa 
prisposobit’ vysvetlenie aj konkretnemu novinovemu clanku. Vysvetl’ovanie pre clanky 
suvisl so snahou najdenia sposobu ako vysvetl’ovat’ urcity typ clankov. Tu sa moze 
ukazat’, ze pre urcite oblasti alebo temy je vhodny konkretny typ vysvetlenia. 

Pod’akovanie: Tato publikacia vznikla vd’aka ciastocnej podpore projektov... 
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Annotation: 

Presentation of personalized recommendations 

Nowadays, personalized recommendations are widely used and very popular. However, one of 
the basic problems is the distrust of users of recommendation systems. They consider them as 
intrusion of their privacy. Therefore, it is important to make recommendations transparent and 
understandable to users. To solve these problems, we decided to focus on the area of presentation 
of results recommendations. Specifically, we focused on explanation of each recommendation 
item to the end user. In this context, we have created a recommendation system EXPLORE, that 
uses collaborative filtering as a standard recommendation technique. Under this system, we have 
designed and implemented our hybrid method of personalized explanation of recommendations. 
This method is independent of recommendation technique and combines three basic approaches 
to explanation, in order to provide appropriate type of personalized explanations to the end user. 
Three basic approaches are explanation based on similar users, explanation based on content, 
explanation based on knowledge about user. 
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Abstrakt. OpenPonk (https://openponk.github.io), puvodne znarny jako Dyna- 
CASE, je vznikajici open-source softwarova platforma pro konceptualni mode¬ 
lovani. Cilem je podpora intuitivni tvorby, uprav, overeni a vyuziti rnodelu. Pod- 
poruje take domenove-specificke jazyky (DSL). V soucasnosti OpenPonk umoz- 
nuje ruzne pokrocile prace s konecnymi automaty, Petriho sitemi, BORM ORD, 
DEMO, OntoUML a s diagramy trid UML. Motivaci je nabidnout otevrenou 
a snadno rozsiritelnou platformu pro implementaci modelovacich notaci a algo- 
ritmu. Cilovou komunitu predstavuji vyucujici, vyzkumnici a odbomici z praxe. 

V porovnani s dalsimi soucasnjmti resenimi, ktere jsou zpravidla zalozeny 
na Java/Eclipse/EMF/GMF, nase reseni vyuziva ciste objektove orientovanou 
technologii Pharo/Roassal, kterou je vyrazne jednodussi si osvojit pouhyrn pozo- 
rovanim, zkousenim a naslednym napodobenim. Navic jde o zivy system, ve kte- 
rem je ntozna interakce uzivatele s modely prirno behern vyvoje. 

Popisujeme take projekt zalozeny na OpenPonk pro francouzskou instituci 
CIRAD. 


Typ prispevku: Aplikacni prispevek, Pfispevek o probihajicim vyzkumu 
Klicova slova: CASE, CABE, Pharo, Smalltalk, Konceptualni modelovani 


1 Uvod - myslenky a cfle OpenPonk 

OpenPonk je platforma pro konceptualni modelovani, ktera vznika v zivem prostredi 
Pharo[2], 

Pro potfeby naseho vyzkumu tykajiciho se ontologii a konceptualniho modelovani 
potrebujeme nastroj, ktery podporuje nejen praci se soucasnymi notacemi, ale umozni 
i uzivatelsky privetivou moznost implementace novych notaci a jejich nasledne zkou- 
mani a vyuziti. Krom toho se zajimame o podporu notaci a rnodelu vytvorenych 
na miru ve firemmm prostredi. Nyni jsou implementovana ruzne pokrocila vyuziti 
pro praci s konecnymi automaty, Petriho sitemi, BORM ORD[l 1], DEMO, OntoUML 
a s diagramy trid UML. 
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Cilem naseho nastroje neni pfima konkurence soucasnym komercnim nastrojum 
jako je Enterprise Architect[14] nebo MetaEdit+[8], ale zamerujeme se spise na pouziti 
ve vede a vyzkumu, kde je vyhodou nezavislost na platforme, otevreny kod a snadna 
rozsiritelnost nejen o nove modely, ale i o dalsi funkce. Mezi altemativy patri take na¬ 
stroje zalozene na platforme Eclipse[6], jako je Modelio, Papyrus nebo OpenCABE. 
OpenCABE je predchozi nastroj vyvinuty vramci nasi vyzkumne skupiny, se kterym 
se vsak kvuli omezenlm a komplexnosti platformy Eclipse ani po 6 letech vyvoje ne- 
podarilo dosahnout stavu, kdy by studenti mohli nastroj pouzit pro vlastni projekty vy- 
zadujici implementaci novych modely i funkce. To se s OpenPonk podarilo jiz v neko- 
lika pripadech. 

OpenPonk je zalozen na myslence jednoducheho rozsiritelneho jadra, tedy zaklad- 
nich trid a podpory konceptualniho modelovani a dale rozsiritelneho pomoci pluginu 
zajist’ujicich rozsireni o nove modely, notace a algoritmy dodatecne vytvorene uzivateli 
(uzivatelem se zpravidla mysli uzivatel-vyvojar, ktery pro svoji praci vyviji vlastni 
pluginy OpenPonk). 

2 Architektura 

OpenPonk slouzi primame pro vyvoj nastroju pracujicich nad modely s grafickou re- 
prezentaci - diagramy. Pro upravy modelu pomoci diagramu je jadro tvoreno die pri- 
ncipu model-view-controller (MVC)[12j. 

Model je v MVC zakladnim kamenem obsahujicim domenovy model. Modelem je 
v nasem pripade meta-model urcite diagramove notace — napriklad UML meta-model 
diagramu trid die specifikace[9], Pokud je MVC model (tedy meta-model diagramu) 
vytvaren primo s ohledem na implementaci v OpenPonk, staci vyuzit predpripravene 
tridy a pouze doplnit specifika pro dany model. Zajimavou moznosti je pouzit jiz exis- 
tujici model, ktery neobsahuje zakladni funkce potrebne pro pouziti v OpenPonk. V ta- 
kovem pripade musi za tyto funkce prevzit odpovednost controller (viz dale), k cemuz 
je mozne vyuzit technologii MetaLinks[4], To dovoli plnou integraci modelu bez nut- 
nosti jeho uprav, coz se podarilo u FAMIX[5] modelu pro UML diagramy trid (viz 
dale). 

View zajist’uje vyobrazeni modelu prostrednictvim jednotlivych elementu na vykres- 
lovaci plose. Ty pote reprezentuji zpravidla konkretni prvek modelu. K tomu vjoizi- 
vame grafickou knihovnu pro praci s vektorovou grafikou Roassal[l], Umoznuje snad- 
nou tvorbu novych tvaru, interakci s nimi a dalsich uprav. U jednodussich notaci, jako 
jsou konecne automaty, postacuji tato rozsireni Roassalu, avsak komplexnejsi notace 
vyzaduji implementovat dalsi vrstvu nad Roassalem. 

Tvorba a sprava vizualnich entit se provadi v controlleru, kde je po uzivateli poza- 
dovana implementace nekolika metod za pomoci metod a trid pripravenych vjadre 
OpenPonk a v knihovne Roassal. Zpravidla ma samotny model i kazdy jednotlivy prvek 
modelu svuj vlastni controller. Controlleryjsou zodpovedne za interpretaci signalu uzi- 
vatele prichazejicich z view a propagaci zmen z a do modelu. Controller se tedy stara 
o propojeni view pro dany model, ale take o ovladani formulare pro upravu udaju a pa- 
lety nastroju pro ovladani a tvorbu prvku diagramu. V OpenPonk nema view primy 
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pristup k modelu, protoze o jejich interakci se pine stara prave controller. To je vhodne 
zejmena pro vyuzitl jiz existujlclch modelu, ktere nebyly na podobnou integraci navr- 
zeny. 

Zodpovednostl controller!! je take prubezna validace propojenl prvku pomocl hran 
a vlozenl prvku do jinych jiz behem jejich tvorby, kdy pro zakladnl funkce stacl imple- 
mentovat nekolik malo metod pozadovanych tfldami jadra OpenPonk. Pro komplexnl 
validace muze uzivatel vyvinout vhodne rozslrenl, jako je naprlklad OntoUML vali¬ 
dation editor[18], 

Uzivatelske rozhranl je implementovano ve frameworku Spec[ 13]. Zakladnl Spec 
okno obsahuje ovladacl prvky aplikace a Spec podokno Editor, ktery zahrnuje vykres- 
lovacl plochu, paletu nastroju pro praci s diagramem, formular pro upravy udaju a dais! 
podokna vazana k diagramu. Kazde Spec podokno obsahuje API pro jeho upravy. 

3 Moznosti rozsirem a vyuziti 

Kazde rozslrenl - plugin - je podtrldou predpripravene trldy Plugin. Pro zahajenl prace 
s modelem, controllery a vizualnlmi prvky stacl implementovat zakladnl funkcnost, 
pote lze realizovat i ruzna rozslrenl funkcnosti OpenPonk pro dany typ modelu. Dale 
uvadlme nekolik prlkladu. 

3.1 Uprava modelu pomocl skriptu 

Krome klasickeho prlstupu ke tvorbe instancl modelu (modelu konkretnlch diagramu) 
pomocl vizualnlch nastroju majl soucasne nastroje take moznost tvorby a uprav pomocl 
skriptu v jazyclch vyvinutych specialne pro tento jediny ucel. Prlkladem je Epsilon Ob¬ 
ject Language[7] na platforme Eclipse. Vyvojar vsak musl tento skriptovacl jazyk vy- 
tvorit a spravovat a uzivatel se ho musl ucit a potykat se s limity takoveho jazyka. 
OpenPonk je vsak vyvinuta v zivem prostredl Pharo, coz umoznuje manipulaci prlmo 
pomocl programovaclho jazyka Pharo Smalltalk, ktery ma uzivatel k dispozici. 

3.2 Vizualnl simulace 

Pri ocekavanem rozslrenl funkcnosti je nutne vytvorit odpovldajlcl API, ktere ma vsak 
omezene moznosti a vyzaduje dodatecnou spravu. Napr. realizovany simulator konec- 
nych automatu pouzlva prlmy pristup k modelu a pohledu, ktere o tomto rozslrenl ne- 
musl vedet. Simulator zlskava informace z modelu a na zaklade toho upravuje vizualnl 
vrstvu (pohled). 


3.3 UML round-trip engineering pro platformu ABM Cormas 

Ve spolupraci s vyzkumnou skupinou Cirad RU Green jsme vyvinuli editor UML dia¬ 
gramu trld pro platformu ABM Cormas[3][17], obsahujlcl podporu round-trip engine¬ 
ering, tedy moznosti tvorit kod na zaklade modelu a naopak. Cllem nenl zcela automa- 
ticky prevod, ale poskytnutl pomoci s tvorbou struktury kodu nebo diagramu. Prave zde 
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je vyuzito propojenl s FAMIX modelem, aniz by tento model bylo nutne prizpusobovat 
pro pouzitl s OpenPonk. 

Podpora pro zpetnou tvorbu modelu z vygenerovaneho kodu vyzaduje nektere infor- 
mace, ktere v kodu nemusl byt obsazeny, coz plat! o to vice v dynamicky typovanych 
jazyclch. Aby kod tyto informace obsahoval, tak musl generator kodu pridat dodatecne 
informace pro tyto ucely. Rozlozenl prvku diagramu je paktvoreno automaticky[16], 

4 Zaver 

Jelikoz nase vyzkumna skupina pracuje s ruznymi formami a vyuzitlmi konceptualnlho 
modelovanl, je pro nas OpenPonk velmi vyznamnym nastrojem. 

OpenPonk je uspesnym pokracovanlm naseho predchozlho nastroje na platforme 
Eclipse, OpenCABEflO], ktery poslouzil jako zdroj nejlepslch architektonickych prin- 
cipu. Ty byly pouzity a upraveny tak, aby vznikl vysoce upravitelny a rozsiritelny na- 
stroj s jednoduchymjadrem. Z nazvu OpenPonk je patma cast „otevreny“. Tohoje do- 
sazeny i dlky otevrene, dynamicke a zive platforme Pharo. 

Nastroj je v mnoha ohledech stale vrane fazi vyvoje, avsak jiz nynl byl uspesne 
pouzit pro dais! projekty. Rozsiritelnost a jednoduchost principu OpenPonk se ukazala 
i ve studentskych projektech, kde se studenti byli schopni uspesne seznamit s platfor- 
mou a vyuzlt ji pro sve prace spoclvajlcl v implementaci nove notace ci novych algo- 
ritmu. Naslm cllem je poskytnout komunite nastroj pro vyzkum, vyvoj a experimento- 
vanl, k cemuz se priblizujeme s kazdym novjmi uzivatelem. 

V soucasnosti spolupracujeme s nizozemskou spolecnostl ForMetis Enterprise En¬ 
gineers na vyvoji simulacl a validacl prumyslovych modelu a jsme v kontaktu s INRIA 
Lille Nord Europe a Univerzitou v Antwerpach, ktere se zajlmajl o blizsl spolupraci. 

Dais! informace o OpenPonk je mozne nalezt v obsahlejslm clanku v anglickemja- 
zyce[15], 

Podekovani: V soucasne dobe je vyvoj OpenPonk sponzorovan firmou ForMetis Con- 
stultants. Vyvoj podpory round-trip engineering (zejmena editoru UML digramu trid 
pro ABM CORMAS) byl financovan diky RU Green CIRAD. Vyvoj MetaLinks byl 
sponzorovan spolecnostl Synectique a ESUG prostrednictvlm programu Mobility 
Support. 
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Annotation: 

OpenPonk — A Conceptual Modelling Platform for Education, Research and Practice 

OpenPonk, formerly known as DynaCASE, is an emerging open software platform for concep¬ 
tual modelling. Its goal is to support a user-friendly diagram creating and editing and further 
models validations, transformations and other algorithms. Working with domain-specific lan¬ 
guages is also supported. Currently, OpenPonk contains support for finite state machines, Petri 
net, BORM ORD, DEMO, OntoUML and UML Class Diagrams in various stages of maturity. 
The vision is to offer an open, easily extensible platform for implementing modelling notations 
and algorithms. Researchers, teachers and practitioners are the target community. 

Compared to other current solutions, which are usually based on Java/Eclipse/EMF/GMF, our 
solution is implemented in a pure object-oriented technology Pharo/Roassal, which is conside¬ 
rably simpler to master and extend by "watch and learn" and "copy-paste". Moreover, our solu¬ 
tion is a live system, where the user may interact with the models during the development. 

We also present a project for French institute CIRAD, which is based on OpenPonk. 

We recommend paper in the English language[15] for more information. 
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Abstract. Nowadays, data acquisition, subsequent processing as well as analytics 
are gaining importance within the industrial automation domain. We are witness¬ 
ing the trend of replacing traditional approaches with more capable Big Data 
methods and paradigms. To exploit this change, we propose so called Semantic 
Big Data Historian to handle data of larger volume, variety and velocity. Our 
historian software prototype benefits from ontological data model and from Ha- 
doop platform. In this paper, we describe briefly our implementation of a proto¬ 
type of such historian software. Next, we introduce possible models of storage 
layer of our proposed Semantic Big Data Historian with respect to RDF data for¬ 
mat. The storage models exploit Distributed File System of the corresponding 
Big Data framework for storing RDF data. Finally, the proposed and imple¬ 
mented hybrid model is introduced together with the possible extensions. 

Contribution type: Research paper 

Key words: Ontology, RDF, Big Data, Hadoop 


1 Introduction 

The trend of exploiting Big Data paradigms and technologies is coming to the domain 
of industrial automation for sensor data acquisition, processing and acquisition. This 
approach for data management and processing offers usable ways how to solve obsta¬ 
cles such as huge data volumes, requirements for “real-time” data processing, and in¬ 
creasing data heterogeneity. Furthermore, the requirements for processing and analyz¬ 
ing heterogeneous data sources are more evident in this domain, and therefore data have 
to be accompanied by their semantic description including their mutual relations. 

To address these needs, we have proposed and implemented Semantic Big Data His¬ 
torian (SBDH) [1] which is able to handle semantics and Big Data as well. The core 
data we are storing are data from sensors. For describing both the sensors and data we 
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use the SHS (Semantics for Historian Sensors) ontology which is based on SSN (Se¬ 
mantic Sensor Network) ontology 1 . We have extended existing concepts and relations 
to capture the data that we need to process. 

One of the most important SBDH issues is a realization of the SBDH storage layer 
which is responsible for RDF data handling. As we already indicated, the solution ex¬ 
ploits a Big Data framework for data processing - specifically Apache Hadoop 2 . Thus, 
we discuss possibilities how to store RDF data by means Hadoop Distributed File Sys¬ 
tem in this paper. The data nature (predominantly time-series) is taken into considera¬ 
tion. 


2 HDFS Model 

Hadoop Distributed File System (HDFS) is scalable and reliable data storage designed 
to run on commodity hardware [2], HDFS is similar to other distributed file systems, 
but the key differences are as follows - highly fault-tolerant, designed for low-cost 
hardware, provides high throughput access to application data, and relaxes a few 
POSIX 3 requirements to enable streaming access to file system data. Next, HDFS is 
able to manage large files and therefore is suitable for storage layer of the Semantic Big 
Data Historian. 

We identified three different approaches of possible HDFS utilization for storing 
RDF data according to the way they handle data model: 

— Single file model: preserves the triple construct of classical RDF. 

— Vertical partitioning model: splits RDF triples according their property. 

— Entity class-based model: utilizes high-level entity class graph to create RDF par¬ 
titions [3], First, similar entities (subjects) are grouped (according to similarity meas¬ 
ure) into an entity class. Corresponding entity class graph is then partitioned using 
METIS 4 . This model is not discussed in the following sections because it is not used 
in SBDH. 

2.1 Single File Model 

The single file model preserves the RDF triples in the form ( subject , predicate, object). 
In other words, data are stored within HDFS in one file. The HDFS is then responsible 
for splitting the file into blocks, replicating the blocks, etc. 

The system based on this approach and HDFS is for example PigSPARQL [3], Fur¬ 
thermore, SHARD [4] uses a variation of the single file model where triples with the 
same subject are merged into a one line of a HDFS file. 


1 https://www.w3.org/2005/lncubator/ssn/ssnx/ssn 

2 http://hadoop.apache.org 

3 POSIX - The Portable Operating System Interface 

4 http://glaros.dtc.umn.edu/gkhome/metis/metis/overview 
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:CO2ds048 rdf:type :C020bsValue :hasQuantityValue "355.0" 
:hasQuantityUnitOfMeasurement :parts-per-million 


2.2 Vertical Partitioning Model 

The previously introduced single file model is easy to implement but has some di¬ 
sadvantages. The main obstacle is the I/O cost during query processing. A more suitable 
model is represented by vertical partitioning model. In this model, triples are parti¬ 
tioned with the respect to their property and stored in files named according the corre¬ 
sponding property name. The vertical partitioning model is employed for example in 
[5], In the case of SBDH, the file hasQuantityUnitOfMeasurement contains the follow¬ 
ing data: 

:CO2ds048 :parts-per-million 
:THSds075 :percentage 
:THSds075 :degreeCelsius 
:PRSds032 :hectopascal 

This model overcome deficiencies of the single file model but data are not homoge¬ 
nously distributed in files in some cases (e.g., the type file is usually very big file). In 
the case of SBDH, the biggest file would be hasQuantityValue. 

Further file splitting can be performed for ensuring homogeneous data distribution 
among files. HadoopRDF 5 creates partitions according to data property and object as 
well. For example, the triple ( :THSds075 .-hasQuantityUnitOfMeasurement .-percent¬ 
age ) would be stored in a file named hasQuantityUnitOfMeasurement#percentage. 


3 Hybrid SBDH Model 

The current realization of the SBDH storage architecture is based on combining single 
file model and vertical partitioning-like model. This hybrid model replaced previously 
used single file model which was insufficient due to the high I/O costs during query 
processing. The single file model is unsuitable for time-series data storage. Especially 
for queries with range filter expressions and order constraints. 

In detail, the vertical partitioning is used for all sensors measurements where the 
partitions are created with the respect to subject and property accompanied by 
timestamp. For example, the file CO2ds048#hasQuantityValue contains the following 
data: 

2012-04-29T00:00:10 355.0 
2012-04-29T00:00:40 355.1 
2012-04-29T00:01:10 355.0 


5 http://cs.utdallas.edu/semanticweb/Hadoop-RDF/hadoop-rdf.html 
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Other triples are stored according to the single file model. The different data handling 
of sensors measurements reflects the fact that an amount of measurements is signifi¬ 
cantly bigger than the rest of data. 

4 Conclusion 

In this paper, we have tackled the problem of the RDF data storage for Semantic Big 
Data Historian. We shortly described possible ways how to store data by means of 
HDFS. Then, we briefly introduced the hybrid model of SBDH storage layer. 

The combination of semantic description of industrial data together with exploitation 
of the Hadoop framework represents important step towards a scalable, robust, distrib¬ 
uted, and fault tolerant data processing and analytical solution. Such a solution is es¬ 
sential for allowing more efficient and more useful decision making. 

Based on our preliminary measurements, we conclude that the hybrid model is sat¬ 
isfactory for SBDH storage layer. However, there are still some deficiencies. As our 
future work, we are planning to optimize time-series storage. Samples querying could 
be sped up by utilization of more advanced data structure where data are grouped by 
particular time interval - e.g., measurements can be grouped in hierarchical structure 
by year, month, day, hour, etc. 
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Abstrakt. Prispevek se venuje dvema aspektum tvorby propojitelnych dat z ob- 
lasti rozpoctu a vydaju verejne spravy (fiskalnich dat), ktera je soucasti naplne 
projektu Florizont 2020 OpenBudgets.eu. Jedna se o navrh datoveho modelu, po- 
staveny na aplikaci RDF slovniku Data Cube Vocabulary (DCV) a kodovnicich 
ve formatu SKOS, a o prevod dat (typicky ve formatu CSV) do ciloveho formatu 
RDF pomoci transformacnich aplikaci, zejmena tzv. ETL frameworku LinkedPi- 
pes. Soucasti prevodu je i validace dat systemem integritnich omezeni. 


Typ prlspevku: Aplikacni prispevek 

Kllcova slova: RDF, rozpocet, Data Cube Vocabulary, propojovani 


1 Uvod 

Data o rozpoctech a vydajich verejne spravy (fiskalni data), jsou z technickeho pohledu 
soubory pozorovani popsana hodnotami urcitych charakteristik (dimenzi). Typicky se 
jedna o dimenze casove, prostorove (geograficke), administrativni, tematicke, apod. 
Data lze reprezentovat jako vicerozmerne "kostky" a nasledne analyzovat prostredky 
datove analytiky (interaktivni vizualizace, data mining apod.). 

V projektu Horizont 2020 OpenBudgets.eu (2015-2017) je cilem podporit ruzne sce- 
nare vyuzivani fiskalnich dat ze strany novinaru, nevladnich organizaci bojujicich proti 
korupci (napr. Transparency International) i mistnich obcanskych aktivistu - specificky 
v kontextu tzv. participativni tvorby rozpoctu. Tyto tri oblasti proto byly zvoleny jako 
modelove cilove ulohy, s odpovidajicimi pracovnimi balicky v aplikacni casti projektu. 
Jednotlive pracovni balicky technologicke casti projektu se pakpostupne venuji 1) de- 
finici datoveho modelu, 2) extrakci, predzpracovani a propojovani dat, a automatizova- 
nym analytickym uloham nad nimi, 3) vizualizaci dat a vysledku analyz, a 4) integraci 
vyvinutych nastroju do jednotne platformy. V tomto prispevku se venujeme prvnimu 
a castecne druhemu okruhu aktivit, na kterych se zasadnim zpusobem podili tym z CR 
(pod hlavickou VSE Praha, ale s faktickym zapojenim expertu i z MFF UK a FIT 
CVUT, tj. jde o spolecne usili akademickych partneru iniciativy OpenData.cz). 
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Potential fiskalnlch dat pro analyticke ulohy se vyrazne zvysuje jejich obohacenlm 
o data podrobneji popisujlci objekty, kterych se tykajl jednotlive dimenze datove 
kostky. Geograficke lokality (obce, regiony) tak mohou kuprlkladu byl; zahrnuty spolu 
se svymi demografickymi, ekonomickymi (napr. HDP na osobu) a politickymi (napr. 
dominantnl politicka strana) charakteristikami. Vzhledem k potencialu takoveho pro- 
pojovanl byly v projektu pro modelovani a zpracovanl dat zvoleny technologie zalo- 
zene na jazyku RDF. 

2 Datovy model fiskalnlch dat 

Kllcovou roli pri modelovani dat sehrava RDF slovnlk Data Cube Vocabulary (DCV) 
[1] a velke pocty kodovnlku verejne spravy prevedenych do formatu SKOS. Zakladnl 
datovy model OpenBudgets.eu zahmuje celkem 20 komponent odpovldajlcl specifikaci 
DCV: 

• 17 dimenzi, definujlclch zejmena fiskalnl obdobl, administrativnl, ekonomickou 
nebo funkcnl klasifikaci vydaje, fazi prlpravy/realizace rozpoctu, prljemce platby, 
asociovany projekt, nebo datum vzniku vydaje. Flodnoty dimenzi jsou zpravidla vy- 
blrany z odpovldajlclch kodovnlku. 

• 2 atributy (menu platby a odlisenl, zda platba zahmuje zdanenl) 

• 1 mini (samotnou fmancnl castku). 

Standardnl postup vyuzitl datoveho modelu typicky zahmuje nasledujlcl kroky: 

1. Analyzu vstupnlho datoveho souboru (typicky v tabulkovem formatu CSV nebo na 
nej prlmocare prevoditelnem): vyznamu jeho jednotlivych sloupcu a vnich pouzl- 
vanych hodnot. 

2. Urcenl, ktere ze sloupcu je vhodne zaradit do clloveho formatu dat, a jakemu typu 
komponenty (dimenze, atribut, mlra) v danem kontextu odpovlda. 

3. Prime namapovanl tech komponent z modelu OpenBudgets.eu, ktere dostatecne od- 
povldajl vyznamu sloupcu. 

4. Vytvorenl novych komponent, a to pokud mozno jako specializacl (podvlastnostl) 
stavajlclch komponent, a jejich namapovanl na sloupce. 

5. Sestavenl tzv. defmice datove struktury (DSD) z namapovanych komponent. 

6. Prevod dat do formatu RDF, ve strukture odpovldajlcl vytvorene DSD. 

Vztah mezi mnozinou komponent, DSD a fiskalnlm datasetem o tfech dimenzlch je 
schematicky naznacen na Obr. 1. V prostrednl casti schematu nahore je naznaceno od- 
vozenl nove vlastnosti vyjadrujlcl „klasifikace zdroje financovanl", z obecnejsl kom¬ 
ponenty modelu OpenBudgets.eu, vyjadrujlcl „ekonomickou klasifikaci". 

Pouzitelnost datoveho modelu byla overena jak v ramci projektu jako takoveho, tak 
i vprostredl vyuky (predmetu zamereneho na otevrena propojena data - linked data). 
Na zaklade tohoto overenl byla zformulovana soustava integritnlch omezenl, odchyta- 
vajlcl caste modelovacl chyby. Jako takove chyby byly identifikovany zejmena: 
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Model OpenBudgets.eu: 
komponentnl vlastnosti 
die slovniku DCV 
(dimenze, miry, atributy) 



Definice datove struktury Fiskalni dataset 

(DSD) fiskalniho datasetu 



Obr. 1: Vztah modelu OpenBudgets.eu, DSD a fiskalniho datasetu 

• prime vyuziti abstraktnich komponent (napr. „klasifikace“) ktere maji byt pouze spe- 
cializovany novymi komponentami (napr. „ekonomicka klasifikace“) 

• chybejici povinna komponenta (napr. „fiskalni obdobi“ nebo „mena“) 

• vytvoreni nove komponenty ve jmennem prostoru zakladniho modelu (tzv. „na- 
mespace hijacking") 

• vytvoreni komponenty podfazenim nespravnemu typu z DCV 

• pouziti vlastnlho kodovnlku pro dimenzi, ktera ma jiz kodovnik definovan (v tako- 
vem pripade ma byt zavedena nova dimenze jako podvlastnost). 

3 Automaticka transformace dat 

Zpracovani probiha v transformacnlch aplikacich, zejmena v tzv. frameworku Linked- 
Pipes ETL 1 [2], Data jsou extrahovana z primamich zdroju (tabulky v CSV, pripadne 
struktury v XML) a vyjadrena pomoci obecnych komponent DCV doplnenych o prvky 
specificke pro fiskalni data, vcetne kodovniku jednotlivych dimenzi, jsou nasledne va- 
lidovana apropojovana na externi data. Zpracovani je popsano pomoci strukturovanych 
procesu (pipelines), ktere mohou byt opakovane aktivovany pro podobne vstupni da- 
tasety, prubezne monitorovany a modifikovany. 

Jako soucast procesu automaticke transformace byla implementovana i vyse zmi- 
nena integritni omezeni vyjadrena v dotazovacim jazyce SPARQL. Aktualne resenou 
ulohou je pak vyuziti tzv. linkovacich nastroju pro automaticke vjdvareni propojeni 
datovych prvku (zejmena kodovnikovych hodnot dimenzi) na externi datove sady. Lin- 
kedPipes ETL ma takove nastroje, napr. Silk, 2 jiz zaclenene do inventare pouzitelnych 
prvku, v ramci tzv. Data Processing Units. 


1 http://etl.linkedpipes.com/ 

2 http://silkframework.org/ 
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Struktura datoveho modelu i proces zlskavanl a vyuzitl zpetne vazby jsou podrobneji 
popsany v prispevku na konferenci SEMANTiCS 2016 [3]. Aktualnl stav podpory 
tvorby pipelines v prostredi LinkedPipes ETL, na zaklade tzv. „pipeline fragments", 
vcetne specifickych pro model OpenBudgets.eu, pak lze nalezt v prispevku prijatem na 
workshop SemStats 2016 [4], 

4 Zaver 

Projekt OpenBudgets.eu usiluje o systematickou podporu tvorby propojitelnych dat 
v oblasti verejnych rozpoctu a vydaju. Na zaklade uvodnl faze praci, strucne shrnute 
v tomto prispevku, jsou v soucasnosti uskutecnovany ukazkove automaticke analyzy 
a vizualizace, a ty z nich, ktere budou mit sirsi vyuzitelnost, budou nasledne integro- 
vany do dlouhodobe udrzovane softwarove platformy. 

Podekovani: Tato publikace vznikla s castecnou podporou projektu EU Horizont 2020 
c. 645833 (OpenBudgets.eu). 
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Annotation: 

Modeling and transforming fiscal datasets using RDF technology in the OpenBud¬ 
gets.eu project 

The paper addresses the topic of of constructing linked datasets in the fiscal domain, as addressed 
by the (Horizon 2020) OpenBudgets.eu project. The two aspects concerned are: 1) the design of 
a data model based on the RDF-based Data Cube Vocabulary (DCV) and on code lists in the 
SKOS format, and 2) the transformation of source data (typically in the CSV format) to the target 
RDF format using transformation applications (esp. the ETL framework LinkedPipes ETL). The 
transformed data is also validated by a system of integrity constraints. 



Grafova databaze jako uloziste metadat 
pro data lineage - zkusenosti a vyzvy 


Karel Quast, Michal Valenta 

Fakulta informacnlch technologii 
Ceske vysoke ucenl technicke v Praze 
Thakurova 9, 160 00 Praha 6, Ceska republika 

{karel.quast, michal.valenta}@fit.cvut.cz 


Abstrakt. V projektu zabyvajicim se tzv. data lineage jsme jiz pred 3 lety pro 
datove uloziste metadat pouzili grafovou databazi namisto relacni. Od te doby 
pribylo instalaci, zvetsil se objem dat a zmenily se pozadavky zakazniku. Resili 
jsme temporalni dimenzi uloziste a take vice moznych pohledu na hierarchii dat, 
tedy krome fyzicke hierarchie, ktera vychazi z analyzy prislusnych datovych 
slovniku, pridat napriklad hierarchii logickou/konceptualni. Grafove databaze se 
pro tento typ ulohy ukazuji jako perspektivni reseni. V prispevku priblizime po¬ 
zadavky na datove uloziste metadat pro data lineage, podelime se o vlastni zku¬ 
senosti s praktickou implementaci a naznacime dalsi smery rozvoje a vyzvy, 
ktere s nimi souvisi. 


Typ prispevku: Prispevek o probihajicim vyzkumu 

Klicova slova: grafova databaze, temporalni databaze, data lineage 


1 Uvod 

Z hlediska aplikace (praktickeho pouzitl) souvisi nase prace s dynamicky se rozvljejlcl 
oblast! flzenl dat (Data Governance) v oblasti datovych skladu (Data Warehouses). 
Konkretne se zabyvame navrhem a implementaci metadatoveho uloziste pro sledovanl 
tzv. „datove linie“ (data lineage). 

Pojem data lineage v Encyclopedia of Database Systems [1] odkazuje kpojmu „pu- 
voddat ‘ (Data Provenance). Tenje zavedennasledovne: „The term “dataprovenance" 
refers to a record trail that accounts for the origin of a piece of data (in a database, 
document or repository) together with an explanation of how and why it got to the 
present placet ‘ V nasi realizaci metadatoveho uloziste jsme skutecne schopnl sledovat 
puvod dat - tedy odkud (z jakeho zdroje) konkretnl polozka pochazl, budeme se vsak 
radeji drzet pojmu data lineage, protoze je v oblasti datovych skladu a nastroju nad 
jejich metadaty beznejsl a je intuitivne lepe srozumitelny. 

Formalnl zavedenl pojmu data lineage spolecne se zakladnl dvoustupnovou katego- 
rizacl lze najlt v technickem reportu [2], Kontext pro zavedenl pojmu data lineage je 
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defmovan nasledovne: mame mnozinu vstupnlch dat ktera vstupujl do (obec- 

neho) grafu transformacl Ti,...,T n , ze ktereho vychazi mnozina vystupnlch dat 
Oi,...,O m . V tomto kontextu pak zkoumame a popisujeme individualnl transformace - 
data lineage. 

Zprava dale nablzl dve zakladnl hlediska pro kategorizaci problemu data lineage. 
Prvnlm hlediskem je prevazujlcl zpusob dotazovanl - „where“ ptame se hlavne na cestu, 
kterou data v systemu prochazl nebo „how“ kdy davame vets! duraz na porozumenl 
tomu, jak se data men! v prubehu zpracovanl. Druhe hledisko zavadl kategorie 
,jchema“ nebo „instance“ podle toho co sledujeme. Z teto klasifikace pak vychazi 
mnohe dais! prace - naprlklad [3,4], 

Prlspevek je dale clenen takto: v kapitole 2 pribllzlme prostredl, ve kterem uloziste 
realizujeme, kapitola 3 je venovana zkusenostem, ktere s realizacl dosud mame, po- 
slednl, ctvrta kapitola, nastinuje dais! smer vyzkumu a experiments. 

2 Projekt Manta a pozadavky na metadatove uloziste 

Metadatove uloziste, o kterem pojednava tento prlspevek, je jadrem skupiny nastroju 
Manta Tools 1 . Jedna se o produkt spolecnosti Profinit urceny k vizualizaci data lineage, 
rlzenl dat (data governance) a analyze SQL kodu v (heterogennlm) prostredl produkc- 
nlch databazl a datovych skladu vetslch podniku (banky, telekomunikacnl operatori 
apod.), na jehoz rozvoji se CVUT FIT podlll v ramci projektu TACR. 

Z pohledu klasifikace data lineage problemu uvedeneho v predchozl kapitole, se ve 
vetsine prlpadu uzitl nastroju Manta pohybujeme v kvadrantu danem kategoriemi 
„where“ - akcentujeme tedy zobrazenl linie puvodu dat splse nez popis toho, jak se data 
men! a „schema“ - zajlmajl nas prvky (databazove) struktury, nikoliv konkretnl hod- 
noty (instance) - proto metadatove uloziste. 

Z rodiny nastroju Manta Tools se dale soustredlme na nastroj Manta Flow. Ten pra- 
cuje v nasledujlclch kroclch: 

1. Analyza datovych slovnlku vsech databazl v organizaci. Vysledkemjsou hierarchie 
objektu - naprlklad technologie - databaze - schema - tabulka - sloupec, technolo¬ 
gic - databaze - schema - package - procedura - parametr apod. Jednotlive stromy 
spojlme jednlm zastresujlclm uzlem, clmz vznikne jedna hierarchie (strom) ktera 
odrazl fyzickou strukturu ulozenl dat v organizaci. 

2. Analyza ETL procesu a transformacnlch skriptu. Kazdy takovy skript prida oriento- 
vane hrany mezi listy stromu. Kazda hrana znazornuje, ze prlslusny element se men! 
na jiny (sloupec tabulky se stava sloupcem pohledu, sloupec pohledu se stava vstup- 
nlm parametrem procedury, sloupec textoveho (CSV) souboru se stava sloupcem 
tabulky apod.). 


1 https://mantatools.com/products/ 
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3. Vznikly (obecny) graf popisuje jak fyzicke ulozeni dat v databazich, tak datove toky 
v organizaci. Tento graf je zakladem pro vizualizace datovych toku na ruznych urov- 
nich (sloupcu, objektu schematu, databazi,...) a dalsich analyz datovych toku (do- 
padove analyzy, bezpecnostni audit apod.). 

Na pocatku projektu byl pro uloziste metadat pouzit relacni databazovy stroj Post- 
greSQL. Pro (aktualne) stredne velke nasazeni nastroje (radove cca 1 milion uzlu 
a 3 miliony hran) byla odezva databazoveho stroje jeste dostacujici, ocekavalo se vsak 
navyseni mnozstvi dat. 

Vyse popsana struktura uloziste metadat - obecny orientovany graf-je totozna s da- 
tovym modelem tzv. grafovych databazi - dulezitym clenem pestre rodiny tzv. NoSQL 
databazi, ktere se v poslednich 10 letech velmi bourlive rozvijeji, hledaji a casto i na- 
chazeji sve uplatneni ve specifickych aplikacnich domenach. Nejen struktura, ale i ty- 
picke dotazovani - hledani cesty v grafu nebo sledovani datove ho toku od nejakeho 
konkretniho elementu dale primo vybizi k nasazeni tohoto typu databazoveho stroje. 

3 Nasazeni grafove databaze a dalsi rozvoj uloziste 

V teto kapitole nejprve v sekci 3.1 velmi strucne popiseme zkusenosti s nasazenim gra¬ 
fovych databazi, dale pak dve upravy datoveho uloziste - v cash 3.2 to bude pridani 
temporalni dimenze, ktera umozni sledovat vyvoj datovych toku v case a v casti 3.3 
pridani dalsich pohledu na hierarchii datovych objektu. 

3.1 Nasazeni grafove databaze 

Experimentovat jsme zacali s databazovym strojem Neo4j. Pouzili jsme testovaci data¬ 
bazi s cca 1 milionem uzlu a 3 miliony hran. Muzeme konstatovat, ze prostorove naroky 
na ulozeni dat byly u Neo4j cca o 25% vetsi nez u PostgreSQL, s rostouci velikosti 
databaze se zvetsovaly zhruba stejne. Podobne to bylo s importem dat (PostgreSQL byl 
v prumeru o cca 10% rychlejsi). 

V souladu s ocekavanim zvitezila grafova databaze u slozitejsich dotazu (sirsi okoli 
uzlu s pripadnou naslednou filtraci). Zrejme neni ani prekvapive, ze pouziti specializo- 
vaneho dotazovaciho jazyka Cypher, ktery nabizi velmi pohodlne dotazovani, bylo 
o rad pomalejsi nez pouziti primeho Java API. Z hlediska pozadavku na uloziste i rych- 
lost zpracovani se databaze Neo4j ukazala jako vhodna. Detailni popisy mereni a vy- 
sledky lze najit v [7], 

Dalsi experimenty probehly s grafovym DB strojem Titan. Oproti Neo4j ma velmi 
zajimavy rys - podobne jako v pripade MySQL je mozne si zvolit datove uloziste 
(v dobe experimentovani byly k dispozici binami uloziste Persistlt, BerkeleyDB 
a Cassandra). 

Pri testovani vykazal Titan podobne vysledky jako Neo4j a nakonec byl v projektu 
nasazen s pouzitim binarniho uloziste Persistlt. 
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3.2 Pridanf casove dimenze 

Dalsi optimalizace zvoleneho uloziste se zamerila na pouziti konkretmch indexu vhod- 
nych pro nase vyuzitl. Tyto experimenty a take pridanl casove dimenze jsou podrobne 
popsany v diplomove praci [5]. Zde se omezime na konstatovani, ze dale pouzlvame 
podpurne (extemi) indexovanl ElasticSearch a informace o casove platnosti objektu 
jsou udrzovane na hranach grafu. 


3.3 Pridanl dalsich hierarchii 

Analyza, navrh a implementace vice pohledu - hierarchii dat vcetne rozsahleho mereni 
je popsana v diplomove praci [6], Zde se pouze omezime na konstatovani, ze jsme zvo- 
lili variantu, kdy odlisna hierarchie je vyjadrena pomoci hrany. 

4 Dalsi rozvoj uloziste 

Do budoucna planujeme vytvoreni komplexnejsiho benchmarku pro testovani tohoto 
typu uloziste a dale zkoumat moznosti efektivni distribuce uloziste (pozadavky vetsich 
zakazniku si to vyzaduji) a paralelizaci importu analyzovanych transformacnich 
skriptu. 
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Annotation: 

Graph database as a storage for data lineage metadata - experiences and challenges 

We used a graph database as a storage for data lineage metadata instead of relational one in 
a project 3 years ago. New product installations appeared, amount of data increased, and user 
requirements changed during this time. We had to design our own solution for temporal dimen¬ 
sion of the storage and we are working on multiple hierarchy data view model (i.e. allow for 
example a logical/conceptual hierarchy alongside with implicitly used physical one. It seems 
graph databases are suitable DBMS for this kind of usage. 

We are presenting some implementation specific details about our approach and share some ex¬ 
periences related to particular SW used in the project. We also try to scratch challenges we are 
facing now and corresponding ways of research. 
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Abstrakt. Zivotni udalosti (zivotni situace) jsou obvykle chapany jako zvlastni 
pohledna cinnosti verejne spravy, blizky jejirn klientum: obcanum. Tento pohled 
obvykle pomaha efektivne a pro klienty srozumitelne organizovat webove plat- 
formy verejnych instituci. Avsak skutecny vyznam zivotnich udalosti je mnohem 
podstatnejsi. Jde o pohled, kdy cinnosti verejne spravy vnimame jako dusledky 
skutecnych udalosti v realnem zivote klientu verejne spravy. Prispevek predstavi 
pristup k analyze zivotnich situaci v ramci modelovani zivotnich cyklu objektu 
verejne spravy a vyuziti tohoto pristupu v realnem probihajlcim projektu. Budou 
tez diskutovany dulezite souvislosti, jako je vztah zivotnich udalosti k procesum 
ve verejne sprave, jakoz i jejich vztah k e-governmentu, a to vcetne ilustraci na 
prikladech ze zmineneho projektu. 


Typ prispevku: Vyskumni prispevek 

Klfcova slova: zivotni situace, verejna sprava, konceptualni modelovani, pro- 
cesni rizeni. 


1 Uvod 

Verejnou spravou (VS) rozumlme spravu vecl verejnych, jez objektivne vyplyva z 
potreby pece o hodnoty, ktere presahujl rozmer individua. Tato potreba plyne z faktu, 
ze podstata cloveka je spolecenska, fungujlcl spolecnost je zakladnlm predpokladem, 
prostredlm a take mlstem osobnl realizace kazdeho cloveka. Nelze pritom spolehat na 
automatickou shodu osobnlch zajmu a konanl se zajmy spolecnosti jako celku. Na dru- 
hou stranu existujl individualnl potreby, problemy a situace, ktere dana osoba nenl 
schopna resit vlastnl silou, prestoze mohou byt pro ni kriticky dulezite, az fatalnl. To 
vse urcuje potrebu a smysl existence verejne spravy. 

Spolecne hodnoty, tvorlcl podstatu potreby VS, dellme do trl oblastl: 
Fyzickym prostredlm rozumlme dane uzeml a jeho prlrodnl, a dais! fyzicke hodnoty, o 
nez je treba pecovat. Socialni prostredi predstavuje lidi a jejich osobnl a spolecenske, 
tedy kultuml a dais! hodnoty, dulezite pro fungovanl spolecnosti. Podnikatelskym 
prostredlm pak rozumlme veskere hodnoty, potrebne pro vyuzlvanl prllezitostl k reali- 
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zaci hodnot. V techto trech oblastech je objektivni potreba verejneho konanl. Zaklad- 
nlm ukolem VS je konat ve smyslu pece o hodnoty ve vyse uvedenych trech zakladnich 
dimenzlch zivota spolecnosti, jez se vzajemne prolinaji, ovlivnujl a podminujl. V tomto 
pojeti tedy VS nenl neclm danym, definovanym a konservovanym jednou provzdy le- 
gislativou, nanzemmi, ci dokonce technologii, ale dynamickym systemem pece o dy- 
namicky se rozvljejlcl svet vecl verejnych, kde kazdy jeden prlslusnlk spolecnosti ma 
svou danou osobnl odpovednost a s tlm i souvisejlcl prava. Schopnost dynamiky v jed- 
ndni VS je hlavnim determinantem dynamiky rozvoje cele spolecnosti. Aby byla VS 
dostatecne dynamicka, prizpusobiva menlclm se podmlnkam 1 , musl byt postavena na 
zakladnich obsahovych prvclch zivota samotneho, ktere existujl relativne nezavisle na 
zmenach prostredl a jsou zakladnlm obsahovjnn vymezenlm smyslu cinnostl VS. Za 
tyto zakladni prvky povazujeme kllcove udalosti v zivotech jednotlivych akteru a ob- 
jektu spolecnosti -tzv. zivotni situace. 

2 Konceptualm analyza oblasti pusobeni verejne spravy jako 
zaklad analyzy zivotnich situacf 

Pojem zivotni situace (nebo take zivotni udalosti) je ve verejne sprave pouzlvan zhruba 
od konce devadesatych let. Puvodnl, a stale v podstate jediny pouzlvany, vyznam to- 
hoto pojmu souvisl s tvorbou webovych portalu organizacl verejne spravy. Zivotni si¬ 
tuace predstavujl puvodne netradicnl pohled na verejnospravnl „agendy“ - pohled z 
pozice potfeb / problemujejlho klienta. Vjedinem, jizneexistujlclm, britskem projektu 
LEAP [6] byly zivotni situace pojlmany nam podobnym zpusobem, tedy jako zakladni 
prvky zivotnlho cyklu objektu VS (s tlm, ze zde slo o jediny objekt Obcan). Nas prlstup 
je tak ve svetovem merltku v podstate jedinecny. V projektu obskumlho nazvu „Opti- 
malizace zivotnich situacl ve vztahu k registru prav a povinnostl“ [2], vedenem na Mi- 
nisterstvu vnitra CR v roce 2015 v ramci rozvojovych projektu EU, se neocekavane 
podarilo prosadit myslenku, ze nezbytnym zakladem ke koncepci VS, zejmena v duchu 
tzv. eGovernmentu, kdy maximum rutinnlch akcl VS ma b>h prevzato technologii, je 
objektivnipfedstava zivotnich situaci, jejich zakladni seznam a vedoml jejich dulezi- 
tych souvislostl. Takova objektivni pfedstava musl vzejlt z dostatecne exaktnl analyzy 
puvodu takovych situacl. Proto se stala zakladem zmlneneho projektu konceptualm 
analyzaprostfedi, v nemz ma VS konat. Jelikoz faktickyjde o obecny model realneho 
zivota ve vyse zminovanych oblastech (viz Obrazek 1), nazyvame jej: „Ontologicky 
model oblasti pusobeni verejne spravy “. 

Model je verejne pozorovatelny na webu [ 1 ] a je vyveden v jazyku UML [5] s inspiracl 
z jeho rozslrenl OntoUML [3], pro modelovanl slozitych vztahu objektu (viz nlze). 
Pouzlva Class Diagram pro systemovy / globalnl pohled na realne objekty a dale State 
Chart pro detailnl pohledy na zivoty jednotlivych kllcovych objektu. Volba UML je 
motivovana jednak obecne, faktem, ze se jedna o oborovy informaticky modelovacl 


1 A to vcetne novych technologii, jez vytvareji predevsim moznosti konat zcela jinak, 
nez tradicnim zpusobem. 
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standard 2 , jednak specificky objektovym charakterem UML, umoznujicim nejen exis- 
tencni, ale i vyvojovy (zivotocykelny) pohled na objekty, jenz se prave v ontologii ob- 
lasti zajmu verejne spravy ukazal byt klicove dulezitym. 

Zatlmco vyse zminovane 3 zakladni oblasti pusobeni VS jsou externlmi, objektivne 
danymi zdroji zivotnich situaci, zivotni cyklus objektu temto situaclm dava subjektivni 
kontext pnslusneho individua. A prave zohlednovanr individualnrch hodnot je jednlm 
z kritickych problemu (potazmo vyzev) soucasne verejne spravy. Vyznam jedne a teze 
objektivnr udalosti je tak pro ruzne objekty velmi rozdllny, a to prave proto, ze je v 
rozdrlnych kontextech jejich individualnrch zivotu. Jedna objektivnr udalost, zivotni 
zmena (napr. Dosazeniplnoletosti) tak typicky znamena mnozstvi zivotnich situaci roz- 
dilnych vyznamu pro ruzne dotcene objekty ( Rodic , Potomek, Skola,...). Smyslem mo- 
delovani zivotnich cyklu konceptualnich objektu je tak vyjadrit obecne zakonitosti kon- 
textu jejich zivotnich situaci a tim obecne postihnout individualni ruznost vyznamu teze 
fyzicke udalosti pro ruzne jeji aktery. 

2.1 Modelovani zivotnich situaci 

Vedle class diagramu (viz Obr. 1) pro systemovy pohled na objekty a jejich vztahy byl 
k modelovani zivotniho cyklu objektu, coby soustavy zivotnich situaci, pouzit dalsi 
klicovy diagram jazyka UML: state chart (viz Obr. 2). 



Obr. 1 Systemovy model objektu zajmu verejne spravy - fragment 


2 I vzhledem k tomu, ze projekt je soucasti smerovani k tzv. eGovernmentu, tedy by 
mel byd schopen byt zakladem k implementaci aplikaci IT ve VS, resp. byt s nimi integrovan. 
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Samotne zivotni situace pak byly modelovany jako zakladni prvky popisu zivotniho 
cyklu - prechody mezi jeho jednotlivymi stavy. Podle defmice stavoveho diagramu 
UML {state chart) je kazdy prechod mezi stavy popsan dvojici udaju: udalost (externi 
podnet k prechodu mezi stavy) a akce (tomu odpovidajici metoda ze zivotniho cyklu 
objektu). Prvni uvedeny lidaj je zivotni udalosti, zatimco ten druhy predstavuje vazbu 
na prislusne reakce verejne spravy na danou zivotni udalost / situaci (vazbu na potrebne 
procesy VS). Zde uvedene priklady jsou zjednodusenym vysekem z puvodniho modelu, 
pozorovatelneho na [1], Z modelu na Obr. 2 je videt, jak jsou zachyceny zakladni, 
obecne platne, casove a kauzalni zakonitosti zivotnich udalosti, vazanych k jednomu 
objektu. Jednotlive popsane prechody mezi stavy vymezuji nutne / toliko mozne vza- 
jemne casove kombinace zivotnich udalosti. Napriklad je videt, ze dosazenim skolniho 
veku jiz dana osoba nikdy nebude mit sanci byt ditetem, nebo ze ze stavu Nezamestnany 
lze uniknout pouze bud’ ziskanim zamestnani, nebo dosazenim duchodoveho veku, v 
nemz, byt’ duchodce muze byt zamestnan, ztrata zamestnani, jakkoliv realne muze na- 
stat, jiz neni, z hlediska verejne spravy, relevantni zivotni udalosti, vyzadujici reakci, 
apod. Na Obr. 2 je take mj. videt, ze nezavisle fatalni udalosti (zde Smrt), nemohou b>h 
modelovany v ramci zivotniho cyklu, jsouce zcela nezavislymi na ostatnich udalostech 
(prakticky tu chybi, na zaklade teto udalosti realne mozne, prechody ze vsech stavu do 
stavu terminalniho). Zohledneni teto udalosti vyzaduje jeste vyssi abstrakci, triviali- 
zujici cely zivot osoby do jedineho stavu Ziva, jehoz jsou vsechny, zde uvedene stavy, 
soucasti. V nasem modelu je tato informace obsazena v (rovnez trivialnim) zivotnim 
cyklu Klienta verejne spravy. 



Obr. 2 Zivotni cyklus objektu Fyzickd osoba 
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Obrazky 1 a 2 take ukazuji zpusob pouziti obou diagramu ve vzajemne souvislosti. 
Kazdy prechod mezi stavy objektu vzdy odpovida nejakemu vztahu k jinemu objektu 
(asociaci, nebo prislusnosti ke generalizacni, ci agregacni strukture). S tim souvisi vy- 
znamny metodicky pfinos tohoto zpusobu modelovani ontologie: poznavani zivotnich 
cyklu, vzajemnych kauzalnich a casovych zavislostijednotlivych udalostijeho zivotniho 
cyklu,je mocnym nastrojem rozvoje poznani nutnych vztahu mezi objektv modelu. 


3 Zaver: metodicke dusledky analyzy zivotnich situaci 

Nehlede na znacny vyznam projektu pro spolecnost a pojeti verejne spravy, zejmena 
pak pro zamer tzv. eGovemmentu, jez nejsou primarnim predmetem zajmu tohoto pri- 
spevku, mel projekt znacny vliv na metodicky rozvoj v oblasti konceptualniho a onto- 
logickeho modelovani. Ukazalo se predevsim, ze pro potrebu modelovani zivotnich si¬ 
tuaci ve vefejnospravnim vyznamu, je treba videt v realnem svete jen nekolik malo 
fyzickych objektu, povazovatelnych za zakladni (na Obr. 1 jsou to v podstate jen dva 
zakladni druhy Klienta VS - Spolecnost a Fyzicka osoba). Drtiva vetsina objektu, je- 
jichz zivotni cykly je treba modelovat, jsou abstraktni objekty, predstavujici ruzne vy- 
znamy objektu zakladnich (jejich role) v ruznych vzajemnych souvislostech a z ruznych 
uhlu pohledu (viz vsechny ostatni objekty na Obr. 1). Bylo pritom nutno prekonat roz- 
por mezi generalizacnim a agregacnim vyznamem teze struktury, predstavujici jednak 
specializaci vyznamu a soucasne i jednotlive faze, agregovane v zivotnim cyklu tehoz 
objektu Je pritom nutno prekonat rozpor mezi generalizacnim a agregacnim vyznamem 
teze struktury, predstavujici jednak specializaci vyznamu a soucasne i jednotlive faze, 
agregovane v zivotnim cyklu tehoz objektu. K tomu ucelu byla vytvorena rada struk- 
tumich vzoru, postavenych na zakladnich stereotypech objektu: Kind-Phase, Kind- 
Kind, Kind-Viewpoint-Phase a Relation-Viewpoint-Kind-Phase, jez jsou popsany v 
metodicke dokumentaci k nalezeni na [2], Vzory volne vychazeji z jazyka OntoUML 
[3] a rozsiruji jej o problematiku modelovani dynamiky (casovych aspektu) objektu. 
Zminene struktumi vzory pak take ukazuji mozny smer vhodneho rozsireni jeho meta- 
ontologie UFO [4]. 

Obrazek 3 ukazuje priklad pouziti vzoru Relation-Viewpoint-Kind-Phase. Jde o 
nejslozitejsi ze zminenych vzoru, z nich slozeny a v modelu nejcasteji potrebny. Pou- 
ziva se pro modelovani vztahu mezi objekty (stereotyp «relation»). Mezi tjnniz 
dvema objekty tricky existuje soubezne mnozstvi ruznych vztahu a soucasne az neko¬ 
lik skupin vztahu, jez se vzajemne vylucuji. Kazdy jeden vztah muze b>4 zapotrebi (a 
patrne, az na male vyjimky, bude) modelovat zivotnim cyklem. Kazdy vztah k jinemu 
objektu zpravidla predstavuje jistou roli, kterou modelovany objekt, ve vztahu ktomu 
druhemu, hraje. Objekt je tedy modelovan souhrnem svych paralelnich zivotu v 
ruznych hranych rolich. Jednotlive soubezne zivoty jsou na sobe bud’ nezavisle - jsou 
vzajemne asynchronni (ty jsou modelovany jako agregat ruznych uhlu pohledu - «vie- 
wpoint»), nebo se vzajemne vylucuji (tyjsou pak modelovany jako prosta speciali- 
zace stereotypu «kind»). Priklad na Obr. 3 ukazuje soucasne nezavisle role Dedic a 
Zamestnanec (lze zamestnat sveho dedice) a vylucujici se role Rodice Potomka a Chote 
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(rodic nemuze byt ditetem sveho potomka, ani s mm nesmi vstoupit v manzelstvi). Pri- 
padne dilci zavislosti zivotu (synchronizaci) pak modelujeme obecnou asociaci mezi 
danymi objekty (pokud by napriklad dedictvi bylo podmlneno manzelstvim, resp. jeho 
specifickym prubehem apod.). 

V budoucnu planujeme, krome pokracovam na obsahu modelu, jej validovat meta- 
ontologii UFO [4] a zahajit tim mezinarodni spolupraci na jeho dalsim rozvoji. 



Obr. 3 Pouzid vzom Relation-Viewpoint-Kind-Phase 
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Annotation: 

Life events as a basic starting point of eGovernment 

Life events (life situations) are usually understood as a specific view on public administration 
activities which is close to their clients: citizens. This view usually helps to effectively and client 
- friendly organize the web platform of the public authority. Nevertheless, the real meaning of 
life events is more essential. Such a view allows regarding the public administration activities as 
consequences of real events in real lives of public administration clients. The paper introduces 
the approach to the analysis of life situations in the context of life cycles of the public adminis¬ 
tration objects and the use of this approach in the real ongoing project. The relation of life events 
to the public administration processes as well as their relation to the e-Government are discussed 
and illustrated with examples from the project. 




Explorace spolecnych charakteristik ontologii formalni 
konceptualni analyzou 


Ondrej Zamazal, Vojtech Svatek 

Fakulta informatiky a statistiky 
Vysoka skola ekonomicka v Praze 
nara. W. Churchilla 1938/4, 130 67 Praha 3, Ceska republika 

{ondrej . zamazal, svatek} @vse . cz 


Abstrakt. Znalostni ontologie jsou na webu dostupne v ruznych kolekclch. Ko- 
lekce byvaji vyuzivany k vyberu ontologii pro testovani semanticko-webovych 
nastroju. Nektere kolekce umoznuji vyhledavani ontologii pomoci atributu ja- 
lcymi jsou napriklad pocty trid a fulltextove vyhledavani podle klicovych slov. 
Vyber jednotlivych ontologii je tak umoznen do te miry do jake atributy odrazeji 
pozadavky na hledane ontologie. Vedle toho existuji prace zamerujici se na ag- 
regovane statistiky kolekci ontologii s cilem umoznit rozliseni kolekci ontologii. 
Zatimco vyhledavani ontologii specifikovanim hodnot atributu narazi na 
omezujici nutnost jasne predstavy nastaveni techto atributu, vyhledavani ontolo¬ 
gii pomoci agregujicich popisnych statistik narazi na omezujici zobecnovani on¬ 
tologii v kolekci. V tomto pfispevku predkladame metodu explorace spolecnych 
charakteristik ontologii formalni konceptualni analyzou. Formalni konceptualni 
analyza usporadava nmoziny objektu z hlediska jejich spolecnych charakteristik 
do podoby formalnich konceptu v konceptualnim svazu. Pruzkum konceptual- 
niho svazu tak muze podporit vyber ontologii z kolekci tim, ze ukazuje souvis- 
losti mezi ruznymi kombinacemi atributu a odpovidajicimi objekty. 


Typ prfspevku: Prispevek o probihajicim vyzkumu 

Klfcove slova: ontologie, semanticky web, formalni konceptualni analyza 


1 Uvod 

Na jednu stranu na semantickem webu pribyvajl znalostni ontologie a na druhou stranu 
se neustale objevujl nove nastroje, ktere tyto ontologie vyuzivaji. Nove nastroje priro- 
zene potrebuji testovat svoji funkcnost na rozlicnych ontologiich. Za timto licelem vzni- 
kaji ruzne moznosti nalezeni vhodnych ontologii. 

Znalostni ontologie lze najit v klasickych kolekcich ontologii nebo pomoci vyhleda- 
vacu. Mezi nejznamejsi vyhledavace ontologii patri Watson, 1 ktery shromazd’uje onto¬ 
logie a dalsi semanticke dokumenty z webu a umoznuje vyhledavani pomoci klicovych 


1 http://watson.kmi.open.ac.uk/ 
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slov z ruznych aspektu ontologii, napr. popisky (labels). Watson take nablzl progra- 
move rozhranl, pomocl nehoz lze zlskat hodnoty nekterych metrik napr. pocty prvku 
ontologii. Podle techto metrik, ale nelze ontologie vyhledavat. 

Vedle vyhledavacu sblrajlclch ontologie volne z webu jsou vytvareny kolekce onto- 
logil, ktere se soustredl na kvalitnl ontologie z jedne oblasti, napr. BioPortal 2 z oblasti 
biomediclny a nebo na kvalitnl ontologie specificky pouzlvane, napr. ontologie z Lin¬ 
ked Open Vocabularies (LOV) 3 pouzlvane pro popis propojenych dat na webu. Obe 
kolekce nablzejl moznost vyhledavanl podle kllcovych slov a zlskanl zakladnlch cha¬ 
rakteristik. Ani tyto kolekce vsak nemajl moznost, jak vyhledavat podle charakteristik. 

Dale jsou vytvareny nastroje, ktere jednak pocltajl souhrnne statistiky ontologii z 
puvodnlch kolekclch a jednak je zprlstupnujl pro dais! pouzitl. Matentzoglu et al. v [2] 
predstavil nastroj 4 pro sdllenl a tvorbu kolekcl ontologii. Nastroj obsahuje sournne sta¬ 
tistiky a zprostredkovava kolekce ontologii BioPortal, Oxford Ontology Library, Tones 
a vlastnl kolekci ontologii MOWLCorp. Sestavenl vlastnl kolekce ontologii je omezeno 
na vyplnenl “offline" HTML formulare s nekolika parametry. Moznost “online” vyhle¬ 
davat ontologie a sestavovat z nich testovacl kolekce podle mnoha (kolem 70) charak¬ 
teristik nablzl nastroj „Online Ontology Set Picker" (OOSP) 5 [4]. 

Zatlmco vyhledavanl ontologii specifikovanlm hodnot atributu prakticky narazl na 
omezujlcl nutnost jasne predstavy nastavenl techto atributu, vyhledavanl ontologii po¬ 
mocl agregujlclch popisnych statistik narazl na omezujlcl zobecnovanl ontologii v ko¬ 
lekci. 

V tomto prlspevku se zabyvame exploracl spolecnych charakteristik ontologii for¬ 
malm konceptualni analyzou jako dais! moznost! usnadnenl vyberu ontologii z ruznych 
kolekcl. Formalnl konceptualni analyza usporadava mnoziny prvku z hlediska jejich 
spolecnych charakteristik do podoby formalnlch konceptu v konceptualnlm svazu. 
Pruzkum konceptualnlho svazu tak muze podporit vyber ontologii z kolekcl tlm, ze 
ukazuje souvislosti mezi ruznymi kombinacemi atributu a odpovldajlclmi objekty (on- 
tologiemi). Tyto souvislosti tak mohou byt vzaty v potaz pri nastavovanl atributu pri 
vyhledavanl kyzenych ontologii v situaci, kdy se clll na ontologie bohate zastoupene 
ruznymi atributy pri jejich dostatecnem mnozstvl. 

2 Explorace ontologii formalni konceptualni analyzou 

Formalnl konceptualni analyza (FKA) [1] predstavuje jednu z metod explorativnl ana- 
lyzy tabulkovych dat. Vyhodou FKA je zlskanl novych netrivialnlch poznatku o vstup- 
nlch datech. Vystupem je tzv. konceptualni svaz jako hierarchicky usporadana mnozina 
shluku neboli formalnlch konceptu z dat na vstupu. V zakladnl podobe FKA pracuje s 
objekty, ktere majl bivalentnl logicke atributy (ano/ne atributy). Podle (ne)pfltomnosti 
atributu jsou objekty rozdeleny do formalnlch konceptu (pojmu), ktere lze chapat jako 


2 http://bioportal.bioontology.org/ 

3 http://lov.okfn.org/ 

4 http://mowlrepo.cs.manchester.ac.uk/ 

5 http://owl.vse.cz:8080/OOSP/ 
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dvojice (A, B), kde A je mnozina objektu a B je mnozina atributu, ktere patri pod pojem. 
Dale musi platit, ze A jsou objekty, ktere vsechny majl atributy z B a soucasne B je 
mnozina atributu spolecna vsem objektum z A. Objekt-atributova data pouze s ano/ne 
hodnotami predstavuji zakladni formalni kontext. V pripade, ze potrebujeme vicehod- 
notove atributy jedna se o vicehodnotovy kontext, kteryje pomoci konceptualniho ska- 
lovani preveden na zakladni kontext pro pouziti FKA. 

V nasem pripade ontologie predstavuji objekty a atributy odpovidaji metrikam on- 
tologii. Celkem pracujeme se sesti skupinami ontologickych metrik (napr. metriky ty- 
kajici se entit). Vsechny atributy jsou vicehodnotove a pro aplikaci zakladni FKA je 
nutne nejprve prevest vicehodnotovy kontext na zakladni konceptualnim skalovanim. 
V pripade numerickych atributu pro tvorbu skaly pouzivame diskretizacni metodu ek- 
vifrekvencnich intervalu. Vsechny atributy jsou nasledne prevedeny na bivalentni va- 
rianty. Pro generovani konceptualnich svazu pouzivame specifikaci minimalni podpory 
v datech jednotlivymi koncepty. 

Exploraci jsme provedli nad ontologiemi z LOV kolekce (509 ontologii) dostupne 
pres nastroj OOSP. Na zaklade testovani jsme dosli k nastaveni maximalmho poctu 
ekvifrekvencnich intervalu na 5 a minimalni podpory 50 ontologii. Na jednu stranu pri 
vyssim poctu ekvifrekvencnich intervalu byla nedostatecna podpora v datech a vy- 
sledny konceptualni svaz mel plochou strukturu. Na druhou stranu nastaveni nizsi pod¬ 
pory by znamenalo prilis malou extenzi nalezenych konceptu. Pri vybranem nastaveni 
maximalne 5 intervalu bylo vytvoreno 167 bivalentnich atributu a vysledny konceptu¬ 
alni svaz obsahoval 5 urovni. Prikladem nalezeneho konceptu na treti urovni je napr. 
{labels [7,20), range class [2,7), object properly range [2,7)} (51), ktery zahmuje 51 
ontologii s relativne malym poctem popisku (labels), pojmenovanych trid v oborech 
hodnot objektovych vlastnosti (range class) a objektovych vlastnosti s defmovanym 
oborem hodnot (object property range). Uvedene intervaly zahrnuji nizsi hodnoty pri- 
slusnych metrik, kdezto vetsina konceptu v konceptualnim svazu obsahuje spise inter¬ 
valy s extremalnimi hodnotami jako napr. koncept ze ctvrte urovne svazu: {labels [104, 
16878], range class [26, 2329], axiom [802, 44101], object property range [27,2329]} 
(55). V tomto pripade koncept obsahuje stejne typy metrik (navic jeste pocty axiomu) 
ale s nejvyssimi hodnotami v intervalech atributu. Tento koncept ukazuje na dalsi ty- 
picky rys nalezenych konceptu pri exploraci. Prislusne metriky v nalezenych koncep- 
tech spolu casto uzce souviseji. 

Dominantni pritomnost konceptu s nejvyssimi hodnotami intervalu muze b>h 
castecne zpusobena tim, ze ontologie s vysokymi hodnotami urcitych metrik maji take 
pravdepodobne vysoke hodnoty metrik souvisejicich (jako napr. pocty axiomu a pocty 
objektovych vlastnosti s definovanym oborem hodnot). Dalsim ovlivnujicim faktorem 
je mozna mira zkresleni metodou diskretizace na ekvifrekvencni intervaly. V ramci na- 
seho testovani ekvifrekvencni intervaly prinasely v procesu explorace smysluplnejsi 
vysledky nez ekvidistantni intervaly, protoze vetsina metrik ma log-normalni rozdeleni 
s odlehlymi hodnotami. V pripade ekvidistantnich intervalu vetsina ontologii byla za- 
hmuta do nekolika malo intervalu s nizsimi hodnotami a ty pak dominovaly pri gene¬ 
rovani konceptualniho svazu. V pripade vyuziti ekvifrekvencnich intervalu pravy krajni 
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interval zahrnuje "dlouhy chvost" (long tail), coz zvysuje sanci, ze vice ontologii z da- 
neho konceptu jsou spolu v "dlouhem chvostu” i v dalsich metrikach a tim se prispiva 
k dominanci nejvyssich hodnot intervals v konceptualnim svazu. 

3 Zaver 

Uvodni explorace pomoci FKA ukazala slibne moznosti nachazeni ontologii odpovi- 
dajicich spolecnym charakteristikam. Dalsi pozomost si zejmena zaslouzi experimen- 
tovani s metodou diskretizace, ktera zasadne ovlivnuje charakter intenzi konceptu. Vy- 
sledne koncepty predstavuji kategorizaci ontologii, kterou planujeme porovnat s vy- 
stupy shlukove analyze [3]. Cilem prace je v budoucnu umoznit uzivateli vyuzivat vy- 
stupu z FKA nebo jine data miningove metodyjako podpory pri sestavovani testovaci 
kolekce ontologii. 

Podekovani: Ondrej Zamazal byl podporen z grantu GACR 14-14076P. 
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Annotation: 

Exploration of common characteristics of ontologies by Formal Concept Analysis 

Designers of new semantic web tools search for ontologies within different ontology repositories. 
Repositories differ not only in characteristics of ontologies but also in means how a user can 
search for suitable ontologies. Some repositories provide an access by specifying values of me¬ 
trics other enable to use a fulltext search using keywords from various aspects. Other works aim 
at overall statistics of repositories. While searching for ontologies by a specification of metrics 
values is restricted due to the fact that a user does not often have an idea of metrics values, using 
overall statistics of repositories is restrictive due to its generality. This paper deals with an ex¬ 
ploratory method of common characteristics for ontologies by Formal Concept Analysis. Formal 
Concept Analysis organizes a set of objects according to their common characteristics into the 
concepts within the lattice. An exploration of the lattice might support a selection of ontologies 
from repositories. 
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Abstrakt. Socialnu siet’ definujeme ako multirelacnu mnozinu udajov, ktoru je 
mozne reprezentovat’ vo forme grafu. V prispevku sa zaoberame metodami, ktore 
umoznuju skumat’ vzt’ahy medzi uzivatel’mi v socialnych siet’ach. Porovnavame 
vysledky, ktore sme ziskali longitudinalnou analyzou socialnej siete ziakov 
troma rozlicnymi metodami. V zavere formulujeme problem vyhl’adavania pre- 
kryvajucich sa komunit z hl’adiska sirenia sa sprav v sieti. 

Typ prispevku: Prispevok o prebiehajucom vyskume 

Kl’iicove slova: zhlukovanie, prekryvajuce sa komunity, kaskady, faktorizacia 
matic 


1 Uvod 

Z pohl’adu dolovania udajov mozeme socialnu siet’ charakterizovat’ ako heterogennu 
a multirelacnu sadu udajov, ktora je reprezentovana grafom. Uzly grafu predstavuju 
objekty, hrany grafu reprezentuju vzt’ahy medzi tymi objektmi alebo interakcie medzi 
nimi. Socialne siete mozeme uvazovat’ nielen v socialnom kontexte, ale existuje vel’a 
instancii socialnych sieti vo svete v podobe technologickych, obchodnych, ekonomic- 
kych alebo biologickych socialnych sieti [3]. 

Komunitu v ramci socialnej siete zvycajne definujeme ako skupinu uzlov, ktore su 
husto spojene s ohl’adom na zvysnu cast’ siete. V pripade, ze kazdy uzol prislucha v da- 
nej sieti len jednej komunite, hovorime o disjunktnych komunitach. Mnohe realne siete 
su vsak charakterizovane tym, ze uzly siete su clenmi viac nez jednej komunity, teda 
hovorime o prebyvajucich sa komunitach [4], 

V tomto prispevku sa zaoberame metodami vyhl’adavania prekryvajucich sa komu¬ 
nit. V prvej casti uvazujeme longitudinalnu analyzu ziakov skolskej triedy, pricom pre¬ 
kryvajuce komunity su tvorene ziakmi, ktori su podobni z hl’adiska ich vzt’ahov k injm 
spoluziakom. V druhej casti analyzujeme problem prekryvajucich sa komunit z hl’a- 
diska sirenia sa sprav v sieti. 
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2 Analyza vzt’ahov medzi uzivatel’mi 

V spolupraci s kosickym gymnaziom sme analyzovali zhluky ziakov skolskej triedy, 
ktori su si v istom zmysle blizki. Uzly ohodnoteneho a orientovaneho grafu predstavujii 
ziakov a hrany su charakterizovane celociselnymi hodnotami v rozsahu od -3 do 3, 
pricom reprezentujii vzt’ah hodnotiaceho studenta k spoluziakom [6], Pri tejto analyze 
bol pouzity jednostranny fuzzy pristup vo formalnej konceptovej analyze [5], ktory 
umoznuje generovat’ prekryvajiice sa komunity. Na redukciu poctu prekryvajiicich ko- 
munit sa vyuziva modifikovany Rice-Siff algoritmus, ktory pomocou funkcie vzdiale- 
nosti a metrickych vlastnosti umoznuje navyse tieto komunity ohodnotit’ z hl’adiska ich 
vyznamnosti. Metoda, ktora vyuziva na ohodnotenie vyznamnosti prekryvajiicich sa 
komunit koncept tzv. hornych alfa rezov z teorie fuzzy mnozin a fuzzy logiky, je pre- 
zentovana v [8], V praci [1] sme tieto myslienky modifikovali tak, aby umoznovala 
vypocty aj v pripade tzv. alfa dolnych rezov. 

Na zaklade zozbieranych udajoch o ziakoch vrokoch 2007, 2011 a 2014 sme 
v tomto vyskume pokracovali a vyhodnotili navzajom tri vzorky ziakov, nie nutne 
zhodnych. Na analyzu sme pouzili modifikovany Rice-Siff algoritmus a metodu hor¬ 
nych a dolnych alfa rezov, pomocou ktorych moze ucitel’ blizsie spoznat’ struktiiru svo- 
jej triedy. Kazda z tychto metod zoradi komunity od najvyznamnejsej po najmenej 
vyznamnu. Pomocou Kendall tau-b koeficientu [9] sme sa snazili odhalit’ korelacie, 
ktore su typicke pre pouzite metody. Oproti tradicnemu Spearmanovho korelacneho 
koeficientu, Kendall tau-b koeficient nevyzaduje hodnoty usporiadat’ podl’a vel’kosti 
a priradit’ im poradie. Na vypocet Kendall tau-b koeficientu sme pouzili balicek Ken¬ 
dall v jazyku R, ale vytvorili sme aj vlastnii triedu v Jave na vypocet tohto koeficientu 
podl’a defmicie, aby sme si potvrdili spravnost’ vysledku. 


Tab 5. Kendallov tau-b koeficienty medzi vyznamnost’ou a veFkost’ou komunit, 
resp. medzi vyznamnost’ou a obl’ubenosfou komunit 



2007 

2011 

2014 1 

vel’kost’ 

obl’iibenost’ 

vel’kost’ 

obl’iibenost’ 

vel’kost’ 

obl’iibenost’ 

Rice-Siff 

0,590** 

-0,206 

0,673** 

-0,295** 

0,619** 

-0,268 

Home rezy 

0,130** 

-0,045 

-0,399** 

0,333** 

-0,071* 

0,071* 

Dolne rezy 

0,181** 

0,196** 

0,306** 

0,203* 

0,252** 

0,181** 


Z tabul’ky mozeme vidiet’, ze vyznamnost’ komunit a obl’iibenost’ ziakov v komunite su 
pomocou metody dolnych rezov pozitivne korelovane a to signifikantne vo vsetkych 
troch vzorkach. To znamena, ze metodou dolnych rezov najprv dostaneme ziakov, ktori 
su menej populami a v d’alsich prekryvajiicich sa komunitach uz tato popularita na- 
rasta. Naopak v Rice-Siff algoritme [5] je silne pozitivne korelovana vyznamnost’ ko¬ 
munit a kardinalita komunit. To znamena, ze princip tohto algoritmu je zalozeny na 
pociatocnom generovani malych skupiniek ziakov, ktore su si vel’mi blizke a postupne 
sa kardinalita generovanych skupin zvacsuje. 

Neprekryvajiice sa (vzajomne disjunktne) komunity v ohodnotenych a orientova- 
nych grafoch je mozne vyhl’adavat’ aj napriklad pomocou algoritmu Infomap [7], Na 
druhej strane, efektivnu metodu na vyhl’adavanie prekryvajiicich sa komunit na zaklade 
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vzt’ahov medzi uzivatel’mi pre neorientovany a neohodnoteny graf prezentuje [4], Uva- 
zujme neorientovany a neohodnoteny graf a jeho maticu prrslusnosti, ktora uchovava 
informaciu o tom, ktore uzly su v grafe navzajom prepojene. Pouzitim metody faktori- 
zacie matic vieme takuto maticu rozlozit’ na dve faktorove matice, ktorych sucin naj- 
lepsie aproximuje povodnu maticu prrslusnosti. Vybrany pocet faktorov (fixne, resp. 
na zaklade vhodnej optimalizacie) zodpoveda poctu najdenych prekryvajucich sa ko- 
munit. Prislusnost’ daneho uzla v grafe k jednotlivym komunitam urcuju hodnoty vy- 
generovane vo faktorovej matici. 


3 Analyza sirenia sprav medzi uzivatel’mi 

Informacie medzi uzivatel’mi (viralny marketing) sa siria ustnou formou, fonnou od- 
porucani na nakup knih, filmov, ale aj vo forme informacnych kaskad, napr. na Twitteri. 
Hovorime aj o tzv. socialnej nakaze, pri ktorej sa jednotlivci zvyknu prisposobit’ spra- 
vaniu ich rovesnikov [2], 

Na reprezentaciu sirenia sprav v socialnej sieti potrebujeme pouzit’ dve grafove 
struktury. Prvou je orientovany graf, ktoreho uzly tvoria uzivatelia a orientovana hrana 
zodpoveda informacii o tom, ze informacia sa siri od jedneho uzivatel’a k druhemu. 
Druhou strukturou je ohodnoteny bipartitny graf, ktory obsahuje dve mnoziny uzlov 
(mnozinu uzivatel’ov U a mnozinu sprav I). Hodnota kazdej hrany (u,i) medzi uzivate- 
l’om a spravou vyjadruje cas, v ktorom uzivatel’ u zdiel’al spravu i svojim nasledovni- 
kom (napr. retweet na Twitteri) [2], 

Kaskadou spravy i budeme nazyvat’ postupnost’ dvojic (u,t), kde t vyjadruje cas, 
v ktorom uzivatel’ zdiel’al spravu i. Z mnoziny kaskad vsetkych sprav vieme na zaklade 
metod popisanych v [10] vybrat’ len take spravy, ktore splnaju istu prahovu hodnotu, 
napriklad priememu vzdialenost’ medzi nasledovnikmi spravy, mieru entropie a po- 
dobne. 

Ked’ze medzi kaskadami a komunitami existuje isty vzt’ah, nasim d’alsim ciel’om je 
skumat’, akym sposobom vieme z kaskad identifikovat’ komunity. Je tiez prirodzene, 
ze hranica danej komunity by mala zastavit’ sirenie obycajnej spravy ajej kaskadu. 
Schema na Obr. 1 znazornuje vzt’ahy popisane v tomto prispevku, pricom prerusovane 
ciary predstavuju problematiku nasho aktualneho zaujmu a vyskumu. 


VZTAHY MEDZI UZIVATEEMI SIRENIE SPRAV V SIETI 


Rice-Siff algoritmus 

Infomap orientovany graf (iiaci, Twitter) 

Faktorizacia matic neorientovany graf (FB, Google+) 


bipartitny graf 


KOMUNITY 

(prekryvajuce sa, disjunktne) 


KASKADY 


Obr. 1 Schema analyzy vzt’ahov a sirenia sprav 
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4 Zaver 

V praci prezentujeme experiment, vktorom skumame prekryvajuce sa komunity zia- 
kov v priebehu niekol’kych rokov. V druhej casti formulujeme uvod do problematiky 
sirenia sprav v socialnych siet’ach, ktoru planujeme v nasom d’alsom vyskume doplnif 
experimentmi a vizualizaciou dat (napr. v systeme Gephi). Detekcia komunit je siroka 
tema, pricom v praxi su zvycajne pouzitel’ne metody s lineamou komplexnost’ou, 
kedze zlozitejsie metody su neskalovatel’ne na realnych datach socialnych sieti. 

Pod’akovanie: Tuto pracu podporilo MSVVaS SR v ramci projektu VEGA 1/0073/15 
a VEGA 1/0475/14. 
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Annotation: 

Messages spreading and relationships between users in social networks 

A social network can be defined as a multirelational data set which is represented by a graph. In 
this contribution, we present the methods for exploring the relationships between the users of a 
special social network. We present the comparison of results which we have obtained in the lon¬ 
gitudinal study of social networks of students by three different methods. We formulate an issue 
of finding overlapping communities regarding the information spread in social networks. 
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Abstract. This paper describes our system created to detect stance in online dis¬ 
cussions. The goal is to identify whether the author of a comment is in favor of 
the given target or against. Our approach is based on a maximum entropy classi¬ 
fier, which uses surface-level, sentiment and domain-specific features. The sys¬ 
tem was originally developed to detect stance in English tweets. We adapted it to 
process Czech news commentaries. 

Contribution type: Work-in-progress paper 
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1 Introduction 

Stance detection has been defined as automatically detecting whether the author of 
a piece of text is in favor of the given target or against it. In the third class, there are the 
cases, in which neither inference is likely. It can be viewed as a subtask of opinion 
mining and it stands next to the sentiment analysis. The significant difference is that in 
sentiment analysis, systems determine whether a piece of text is positive, negative, or 
neutral. However, in stance detection, systems are to determine author’s favorability 
towards a given target and the target even may not be explicitly mentioned in the text. 
Moreover, the text may express positive opinion about an entity contained in the text, 
but one can also infer that the author is against the defined target (an entity or a topic). 
This makes the task more difficult, compared to the sentiment analysis, but it can often 
bring complementary information [3], 

There are many applications which could benefit from the automatic stance detec¬ 
tion, including information retrieval, textual entailment, or text summarization, in par¬ 
ticular opinion summarization. 
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2 Task description 

2.1 Stance detection at SemEval 2016 

The system was originally created for the SemEval 2016 task: Detecting stance in 
tweets [5], The task had two independent subtasks - supervised and weakly supervised. 
The supervised task tested stance detection towards five targets ( Atheism, Climate 
Change is a Real Concern, Feminist Movement, Hillary Clinton and Legalization of 
Abortion). Participants were provided 2.814 labeled training tweets for the five targets. 
In the case of the weakly supervised task, there were no training data but participants 
could use a large number (around 70K) tweets related to the single target: Donald 
Trump. The goal was to classify tweets into three classes - IN FAVOR, AGAINST, 
NONE. The performance was measured by the average FI-score on FAVOR and 
AGAINST classes. 

There were 19 participating systems for the supervised subtask and 9 for weakly- 
supervised subtask. Our system performed well for Abortion (2nd), Climate change 
(3rd) and Hillary Clinton (4th). The overall rank was 9th. In the weakly-supervised 
task, we were ranked 4th, only the top system was significantly better. Official results 
are summarized in the Table 1. 


Tab 1. Overall system performance on SemEval’s Twitter data. 


Topic 

Our system FI (rank) 

Overall FI (rank) 

Atheism 

.5788 (8) 

.6342 (9) 

Climate change is a real concern 

.4690 (3) 

Feminist movement 

.5182(10) 

Hillary Clinton 

.5982 (4) 

Legalization of abortion 

.6198 (2) 

Donald Trump 

.4202 (4) 

.4202 (4) 


2.2 Adaptation to Czech 

We used the same system to detect stance in Czech news commentaries. We collected 
1.560 comments from a Czech news server 1 related to two topics - “Milos Zeman” (the 
Czech president) and “Smoking ban in restaurants” (statistics in Table 2). 

Consider the following example from the topic “Milos Zeman”. 

Target: Milos Zeman 

Comment: ,, To je u Zemana bezne, ze pouzlva nepravdy! Viz Peroutka 2 . ,.." 3 


1 http://www.idnes.cz 

2 President accused famous journalist Ferdinand Peroutka (1895 - 1978) of supporting Hitler. 

3 Can be translated as: “Zeman is doing this normally-using non-truths! For example Peroutka” 
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Tab 2. Czech news commentaries data - statistics. 


Topic 

In favor 

Against 

None 

Total 

Milos Zeman 

180 

170 

300 

750 

Smoking ban in restaurants 

170 

250 

390 

810 


The annotation was done by 2 annotators. There was a fair agreement between them 
(74%), Kappa was 0.61. The agreement level forms an upper bound for system perfor¬ 
mance. 

3 The approach overview 

We preprocessed the Czech commentaries by the same rules as in the original system 
[3] (for example: all URLs were replaced by keyword "URL’, links to images are re¬ 
placed by ‘IMGURL’, only letters are preserved, the rest of the characters is removed, 
...). Moreover, we stemmed the texts by HPS - High Precision Stemmer [2], The sys¬ 
tem is based on a standard maximum entropy classifier [4], trained separately for each 
topic, with the following features. 

It has been showed that unigrams perform quite well in this task [6], Our model is 
based on TF-IDF and uses the top 1000 words from the vocabulary. The rest of the 
features can be turned on or off for each topic. Initial n-grams 4 , as showed in [1] can 
be useful features. Our system supports initial unigrams to initial trigrams. Another 
surface feature was the comment length in words after preprocessing. We used a re¬ 
source borrowed from the sentiment analysis - Entity-centered sentiment dictionaries 
(ECSD): dictionaries created mainly for the purpose of entity-related polarity detection 
[7], 

The original system [3] used more features, which could not be easily applied on 
Czech commentaries. We do not work with tweets, so we could not use a set of features 
generated from hashtags. We have not analyzed the influence of part-of-speech (POS) 
tags yet. We did not identify strong candidates to build a domain specific dictionary as 
in [3], Bigram features did not work in the case of the tweet analysis, so we did not use 
it in this work as well. However, we plan to revisit the influence of bigram, POS or 
domain-specific features. 

4 Results 

Table 4 shows results on the Czech data. We used two evaluation measures. The first 
one was used for the SemEval’16 evaluation - the average Fl-score on FAVOR and 
AGAINST classes. The second one includes the NONE class as well. We used 10-fold 
cross validation to distribute training and testing data. 


4 Initial n-grams are basically the first n words of the sentence. 
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Tab 4. System performance on Czech news commentaries. 


Topic 

FI - (In favor/Against) 

FI - (In favor/Against/None) 

Milos Zeman 

.4347 

.5204 

Smoking ban in restaurants 

.4562 

.5400 


The results show that performance on the Czech data is significantly worse (.43 - .46) 
than on the English tweets corpus (.47 - .62). It is mainly due to the lack of some key 
features like hashtags or domain-specific. Moreover, in the tweets corpus the stance 
tend to lean to one direction (either FAVOR or AGAINST), while in the Czech corpus 
most of the comments are considered neutral (NONE). 

5 Conclusion 

The paper describes the system originally created to participate in Tweet Stance Detec¬ 
tion task in SemEval 2016 and additionally used to detect stance in Czech news com¬ 
mentaries. We experienced worse performance in comparison with the original English 
tweets corpus. It is mainly due to the lack of some significant features like hashtags. 
The current plan is to revisit the influence of bigram, POS or domain-specific features. 

Acknowledgment: This work was supported by grant no. SGS-2013-029 Advanced 
computing and information systems and by project MediaGist, EU’s FP7 People Pro¬ 
gramme (Marie Curie Actions), no. 630786. 
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V tomto prispevku shrnujeme nase zkusenosti s vyvojem a pouzivanim portalu sitit.cz 
- Socialni site informatiku v regionech Ceske republiky reseneho v ramci OP VK (Ope- 
racni program Vzdelavani pro konkurenceschopnost). Po trech letech vyvoje je v pro- 
vozu ctvrtym rokem udrzitelnosti. 

Zminime nase puvodni priklady pouziti [6], Cilem OP VK v CR je rozvoj vzdela- 
nostni spolecnosti za ucelem posileni konkurenceschopnosti CR prostrednictvim mo- 
demizace systemu pocatecniho, terciamiho a dalsiho vzdelavani, jejich propojeni do 
komplexniho systemu celozivotniho uceni a zlepseni podminek ve vyzkumu a vyvoji. 
Cilem oblasti podpory vyzvy (v oblasti podpory 2.4 - Partnerstvi a site) bylo posileni 
vztahu mezi institucemi terciamiho vzdelavani, vyzkumnymi organizacemi a subjekty 
soukromeho sektoru a verejne spravy. Vzhledem k vyzve a prvotnimu pruzkumu zajmu 
v regionech, byla nase prace orientovana na podporu spoluprace akademicke, podnika- 
telske a statni sfery s pouzitim znalostnich profilu. 

Pokusili jsme se ziskat dalsi podporu pro rozsireni funkcnosti portalu, ze ktereho 
jsme planovali vytvorit prostredi pro testovani software (typicky on-line uzivatelske 
studie doporucovacich systemu pro webove obchody [5]). 

V dalsim zminime posun v nasi orientaci. Prvni se tykal nabidky pro dalsi domeny. 
Implementace naseho portalu je vhodna pro experimentalni pouziti v libovolne zna- 
lostne intenzivni domene (staci vymenit XML soubory profilu, [3, 4]). Dalsi se tykal 
jazykovych mutaci. Posledni aktivity se tykaji podpory imitace studentskych start-up 
vizi podle metodologie Lean startup [7], Lean startup je metoda pro rozvoj podnikani a 
produktu poprve navrzena v roce 2008 Ericem Riesem. Na zaklade sve predchozi zku¬ 
senosti z prace na jednom softwarovem projektu, Ries tvrdi, ze jeho metoda muze zkra- 
tit jejich vyvojovy cyklus vyrobku kombinaci experimentu zalozenych na byznys-hy- 
poteze, iterativni zverejnovani verzi produktu, a to, co nazyva validovanym ucenim. 
Ries tvrdi, ze pokud startup investuje cas do iterativniho budovani produktu resp. 
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sluzby na zaklade pozadavku prvnlch zakazniku (early adopters), muze redukovat ri- 
ziko neuspechu. V poslednl dobe se objevily i kriticke hlasy k Lean startup metodolo- 
gii, napr. [2], Autori [2] konstatujl, ze problem nelezl v principech Lean startup-u, ale 
v jejich pouzitl jako univerzalnlho receptu na uspech inovace. Jednoducha resent jsou 
lakava - ale jsou jen zrldka ucinna. Autori se do teto pasti chytili s jejich startupem 
Gamevy. Jejich zkusenosti jsou hodne zretele. 

Nase experimenty s metodologil Lean startup zacaly v ramci vyuky predmetu „Se- 
mantizace webu“ a „Uzivatelske preference" [1], Socialni aspekty portalujsou pouzity 
na imitaci uzivatelske zpetne vazby v ruznych fazlch vyvoje studentskeho projektu. Po- 
kryvame pouze prace od vize po podnikatelsky plan (zadne programovanl). Mlsto mi- 
nimalnlho zivotaschopneho produktu, je ukolem navrh vizualnl podoby procesnlho mo- 
delu uzivatelskeho rozhranl. Studenti jsou povzbuzovani k vizlm, ktere res! kllcova 
mlsta problematiky. V semantizaci webu je to automatizace extrakce a anotace infor- 
macl z webu. V uzivatelskych preferenclch je to jejich ucenl z reakcl uzivatele. 

Podekovani: Portal byl podporen projektem OP VK c. CZ. 1.07/2.4.00/12.0039 a tato 
prace projektem P-46. 
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Annotation: 

SoSIReCR - a social network of computer scientists in regions of Czech Republic 

In this extended abstract we summarize our acquaintance with development and usage of portal 
sitit.cz - social network of computer scientists in regions of Czech Republic. We describe original 
use-case and pivoting of our orientation towards support of students'start-up visions. 
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Abstrakt Socialne siete v dnesnej dobe hybu spolocnost’ou a su jednym z hlav- 
nych informacnych kanalov pre vacsinu populacie. Dianie na socialnych siet’ach 
je preto zaujimave sledovat’ a skumat'. V tweetoch je mozne rozpoznat’ siroku 
skalu emocii, ktora sa da analyzovat’ a na jej zaklade urcit’ aky sentiment prevlada 
pri ucitom hashtagu. Vd’aka tomu mozerne vyhodnotit’ pripadne bezpecnostne 
riziko, pre konkretne osoby alebo miesta. Na testovanie sme si vybrali #Brexit a 
po dobu troch mesiacov sme zhromazd’ovali do databazy tweety ktore obsahovali 
tento hashtag. Tweety obsahovali, okrem hl'adaneho, viacero hashtagov a naj- 
frekventovanejsie pribuzne hastagy bob #VoteLEAVE, #BrexitNoww a #Euref. 
Hodnota sentimentu sa pohybovala okolo -830 bodov, a teda je jasne, ze pouzi- 
vatelia twitteru su nakloneni najednu stranu. Zistili sme teda, ze tweety su ladene 
prevazne negativne, kde negativita smeruje na EU a tym padorn volia autori k 
svojrnu prispevku minimalne jeden zo spontinanych hashtagov. 

Typ prispevku: Work-in-progress paper 

Kl’iicove slova: tweet, analyza postojov (sentiment), #brexit, natural language 
processing 


1 Uvod 

Socialne siete v dnesnej dobe hybu spolocnost’ou a su jednym z hlavnych informacnych 
kanalov pre vacsinu populacie. Dianie na socialnych siet’ach je preto zaujimave sledo- 
vat’ a skumat’. V tweetoch je mozne rozpoznat’ siroku skalu emocii, ktora sa da analy¬ 
zovat’ a na jej zaklade urcit’ aky sentiment prevlada pri ucitom hashtagu. Vd’aka tomu 
mozerne vyhodnotit’ pripadne bezpecnostne riziko, pre konkretne osoby alebo miesta. 
V prispevku popisujeme jeden zo sposobov ako identifikovat’ bezpecnostne riziko na 
zaklade ziskanych tweetov v anglickom jazyku. V pripravovanej webovej aplikacii 
bude mat’ pouzivatel’ moznost’ do vyhl’adavacieho pol’a zadat’ akykol’vek hashtag a nas 
system, na zaklade tyzdennej historie tohoto hashtagu, zanalyzuje tweety aj retweety 
pouzivatel’ov Twitteru. Pouzivatel’ bude mat’ moznost’ obmedzit’ rozsah tweetov pre 
analyzu od niekofkych dm az do par hodin. System tak moze poskytnut’ dlhodobejsiu 
analyzu, ale aj aktualny prehl’ad diania na Twitteri. Po analyze vstupnych dat, pouziva¬ 
tel’ dostane prehl’adnu statistiku analyzovanych slov, slovnych spojeni, pribuznych 



Twitter a #brexit sentiment 218 


hashtagov, percento retweetov a v neposlednom rade sentiment prevladajuci na tomto 
hashtagu. Ked’ze pri niektorych hashtagoch nie je zrejme k akemu vysledku sme sa 
dopracovali ponukneme pouzivatel’ovi aj nahl’ad na 5 tweetov, ktore boli najviac krat 
retweetnute. Vd’aka pribuznym hashtagom, nahl’adu tweetov a bodovemu ohodnoteniu 
bude mat’ pouzivatel’ dobru predstavu ci su tweety ladene pozitivne, negativne alebo 
neutralne. Nepriamo nadvazujeme na nas predchadzajuci vyskum [2, 3]. 

2 Extrakcia kl’ucovych slov 

Na analyzu tweetov pouzivame metody spracovania prirodzeneho jazyka. Podstatna je 
extrakcia kl’ucovych slov, vd’aka ktorym mozeme pouzit’ slovnikovy pristup a bodovo 
ohodnotit’ analyzovany tweet. Pouzity Nielsenov slovnik, pozri [ 1 ], obsahuje 2477 slov 
a ku nim bodove ohodnotenie podl’a sily vyznamu, na stupnici od -5 (negativne) do 5 
(pozitivne). Algoritmus vyhl’ada slova z tweetu, ktore su v slovniku a rozdeli ich na 
pozitivne ladene alebo negativne ladene. Na zaklade tohoto rozdelenia su potom slo- 
vam priradene hodnoty. Suctom tychto hodnot a vydelenim poctom vsetkych slov do- 
staneme relevantny priemerny sentiment daneho tweetu. Proces spracovavania tweetov 
prebiehal sekvencne. Na zaciatku algoritmus prejde cely tweet a ak obsahuje znaky, 
ktore do analyzy nepatria (napr. bodky, ciarky, pomlcky, ...), su odfiltrovane. Tweet je 
rozdeleny na jednotlive slova, ktore su vyhl’adane v slovniku a ak slovo slovnik obsa¬ 
huje, su mu priradene body sentimentu. Ohodnotenie celeho tweetu nastava vtedy, ked’ 
vsetky slova z tweetu, najdene v slovniku, maju svoje bodove ohodnotenie. Tieto body 
sa spocitaju a k tweetu je vypocitane relativne skore (skore vzhl’adom ku poctu slov). 
Vd’aka relativnemu skore dostaneme relevatny vysledok sentimentu. Po spracovani 
tweetu sa ulozia text a skore do databazy. 

3 Pripadova studia - #Brexit 

Na testovanie sme si vybrali #Brexit, pretoze ku tomuto hashtagu je denne obrovske 
mnozstvo vyjadreni a spolocnost’ tato tema zaujima. Po dobu troch mesiacov sme do 
databazy zhromazd’ovali tweety, ktore obsahovali tento hashtag a na zaklade ziskanych 
dat sme ziskali celkom zaujimave vysledky. 


Tab 1. Hodnoty sentimentu v priebehu experimentu 


datum 

sentiment hashtagu 

29/04 

-540 

19/05 

-610 

15/06 

-830 

24/06 

440 

28/06 

670 

02/07 

990 
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Tab 2. Pocet tweetov 


obdobie 

pocet tweetov 

pred referendom 

10799 

po referende 

10338 


Tab 3. Pocet retweetov 


obdobie 

pocet retweetov 

pred referendom 

25.00 % 

po referende 

83.00 % 


Tab 4. Pocet hodnotenych slov obsiahnutych v tweetoch 


obdobie 

pocet hodnotenych slov obsiahnutych v twe¬ 
etoch 

pred referen¬ 
dom 

21416 

po referende 

19584 


Pocet analyzovanych tweetov presiahol pocet 10000, pred aj po referende, pozri Tab 2. 
Zhruba 25 % retweetov sme zaznamenali pred referendom a priblizne 83 % po nom, 
pozri Tab 3. Tweety obsahovali, okrem hl’adaneho, viacero hashtagov a najfrekvento- 
vanejsie pribuzne hastagy boli #VoteLEAVE, #BrexitNoww a #Euref co uzjasne na- 
znacuje aky vysledok analyzy sme mohli ocakavat’. Tabul’ka 4 ukazuje pocet hodnote¬ 
nych slov. 

Hodnota sentimentu sa pohybovala okolo -830 bodov, a teda je jasne, ze pouziva- 
telia twitteru su nakloneni na jednu stranu. Bodove ohodnotenie je ale potrebne pocho- 
pit’ podl’a kontextu zadanej otazky. Minusova hodnota napoveda, ze sentiment sa po- 
hybuje v negativnej rovine, otazkou vsak je, ci to znamena, ze pouzivatelia twitteru 
chcu odchod Britanie z EU alebo su prave proti nemu. V tomto rozhodovani nam mozu 
pomoct’ hashtagy, ale aj nahl’ad najviac krat retweetnutych prispevkov. Zistili sme teda, 
ze tweety su ladene prevazne negativne, kde negativita smeruje na EU a tym padom 
volia autori k svojmu prispevku minimalne jeden zo spominanych hashtagov. 
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Vyvoj sentimentu #Brexit v obdobl 29.04.2016 - 01.07.2016 



Obr. 1 Vyvoj sentimentu pred a po referende 

Z Obrazku 1 je patma vyrazna zmena hodnoty sentimentu pred a po referende. Nelisi 
sa otazka, ktoru sme kladli pri vyhodnocovani dat, ale meni sa uhol pohl’adu. Pred re- 
ferendom bolo nutne vyhodnotit’ negativny sentiment nie ako reakciu na #brexit, ale 
ako reakciu na EU. Po referende sa mozeme vratit’ kpovodnej otazke a brat’ hodnotu 
sentimentu ako reakciu na #brexit a prebehnute referendum. 

4 Zaver 

Hlavnym ciel’om tejto prace bolo vykonanie rozsiahlejsej pripadovej studie analyzuju- 
cej tweet z pohl’adu celkoveho sentimentu a jeho vy voja v case. Vybrali sme vel’mi zivy 
hashtag, ktory sucasne suvisi s bezpecnostnou situaciou hlavne v Europe. Vyhodnotene 
vysledky a prudka zmena hodnoty sentimentu v kritickom obdobi zlomu, po referende, 
ukazuju vysoku citlivost’ metody, a teda jej vhodnost’ pre ziskanie hodnot sentimentu 
napriec rozsiahlou skupinou pouzivatel’ov. 

Dalsi vyzkum bude zamerany predovsetkym na prepojenie roznych zdrojov infor- 
macii o sentimente z rozdielnych domen pre mnohe zaujmove skupiny, a to aj pre jed- 
notlivcov a skupiny pouzivatel’ov. Ocakavame, ze vysledok takejto analyzy prinesie 
zlepsenie analyzy sentimentu suvisiacu s moznymi bezpecnostnymi rizikami. Tuto ana- 
lyzu budeme aj nad’alej spracuvat’ pre anglicky pisane zdroje textov, ale rozsirime ju aj 
na ceske a slovenske zdroje. 

Pod’akovanie: Vyskum bol podporovany projektami Technologickej agentury Ceskej 
republiky - TACR-TFO1000091, a grantom SGS 2016/175, VSB-TU Ostrava. 
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Annotation: 

Twitter and sentiment of #Brexit 

#Brexit before and after the UK referendum was incline thousand times a day. History will an¬ 
swer if it was good or bad decision of British people. Serious social network analysis could help 
us to be in touch with people's sentiment and whit such a knowledge be prepare for the Future. 
They are social, economical, security circumstances linked and weaved together. 

We analyzed more that 10 thousand tweets through 3 months before the referendum, and one 
thousand tweets after it. Tweets were containing more hash-tags that the mentioned one, such 
tags were #VoteLEAVE, #BrexitNoww, and #Euref, their meaning was clear, and the sentiment 
strength was evident and increasing. So, social network analysis evaluated correctly the coming 
decision. The responsible people, such a politicians and bank-head-quartets have had an oppor¬ 
tunity to do their job, does not matter what news and tabloids were talking on their headlines 
news. Such an sentiment analysis could be useful in the future, as it is serious. The case study 
and its evaluation is given ere as well. 




Aplikacie inteligentnych 
znalostnych technology 




Extrakcia strukturovanych objektov z webovych portalov 

na par klikov 

Peter Gursky, Milan Verescak 

Ustav informatiky, Prirodovedecka fakulta, Univerzita Pavla Jozefa Safarika v Kosiciach 
Jesenna 5, 040 11 Kosice, Slovensko 

peter.gurskySupj s.sk, mverescak@gmail.com 


Abstrakt. V tomto aplikacnom prlspevku predstavime zasuvny modul do pre- 
hliadaca, Exago, pomocou ktoreho si pouzivatel' pomocou jednoduchych ukonov 
dokaze anotovat’ objekty na webovom portali. V spolupraci s aplikacnym serve- 
rom, je mozne spustit’ st’ahovanie a naslednu extrakciu atributov tychto objektov 
do relacnej databazy na d’alsie spracovanie. 

Typ prlspevku: Aplikacny prispevok 

Kl’iicove slova: anotacia, extrakcia strukturovanych dat z webu, zasuvny modul 
do prehliada 


1 Uvod 

Projekt Kapsafl] ma za ciel’ vytvorenie metavyhl’aclavaca produktov, ktory by umoznil 
realne porovnavanie produktov intemetovych obchodov na zaklade ich vlastnosti a ko- 
mentarov pouzivatel’ov. Sucasne metavyhl’adavace ziskavaju strukturovane data z in¬ 
temetovych obchodov na zaklade sukromnej komunikacie s tymito obchodmi. V pro- 
jekte Kapsa sme sa rozhodli pre extrakciu informacii o produktoch priamo z ich webo- 
vej prezentacie extrakciou z portalov intemetovych obchodov, co umoznuje ziskanie 
dat z ovel’a vacsieho mnozstva obchodov a poskytnut’ sirsiu ponuku produktov, vacsiu 
vzorku komentarov k produktom, ako aj porovnanie ceny a podmienok vacsieho mnoz¬ 
stva obchodov. 

Na dosiahnutie tohto ciel’a je nevyhnutne realizovat’ pravidelne ziskavanie vsetkych 
relevantnych dat o produktoch z intemetovych obchodov. Nakol’ko kazdy intemetovy 
obchod ma vlastny dizajn, vjhvorili sme anotacny nastroj, umoznujuci pouzivatel’sky 
nenarocnu anotaciu. Ak chceme extrahovat’ data z noveho portalu, alebo v pripade 
zmeny dizajnu predtym st’ahovaneho portalu, staci vytvorit’ novu sadu pravidiel na ex¬ 
trakciu za par minut. 

V prispevku prezentujeme anotacny nastroj Exago, ktory pomaha oznacovat’ vsetky 
relevantne casti detailovych stranok produktov - atributy, obrazky, komentare a speci- 
fikovat’ pravidla pre st’ahovaci a extrakcny server, ktory nasledne moze zadany inter- 
netovy obchod preliezt’ a vsetky produkty z neho vyextrahovat’ do databazy. 
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Obr. 1. Priklad obrazovky pri anotacii komentarov nastrojom Exago. 


2 Nastroj Exago 

Webovych extrakcnych systemov je mnoho (ich porovnanie napr. v [2]). Tieto systemy 
sa delia na manualne [3, 4], ktore vyzaduju programovanie v nejakom pseudojazyku, 
automaticky konstruovane extraktory [5,6], ktore vytvoria extrakcny system na zaklade 
kompletnej rucnej anotacie a extrakcie niekol’kych prikladov webovych stranok, auto¬ 
maticky konstruovane extraktory s ciastocnou podporou pouzivatel’a [7, 8], ktore vy- 
tvaraju extrakcny system bez potreby prikladov extrakcie a na automaticke extraktoiy 
bezpodpory pouzivatel’a [9, 10], ktore vytvaraju extrakcne systemy analyzou opakuju- 
cich sa vzorov na webovych strankach. 

Nastroj Exago je systemom na vytvaranie automaticky konstruovaneho extraktora s 
ciastocnou podporou pouzivatel’a. Presnejsie, vytara sadu pravidiel, ktoru vyuziva ser¬ 
ver na st’ahovanie a extrakciu anotovanych dat. Nastroj Exago je naimplementovany 
ako doplnok do prehliadaca Firefox. To prinasa niekol’ko vyhod. Nastroj je multiplat- 
formovy ajednoducho instalovatel’ny. Vacsina anotacie sa da zrealizovat’ iba udalos- 
t’ami mysi priamo na webovej stranke, ktoru prave anotujeme. 

Cela anotacia sa da vykonat’ na jedinom priklade produktu intemetoveho obchodu 
a pravidla, ktore touto anotaciou vzniknu, su pouzitel’ne na stiahnutie a extrakciu kom- 
pletnych dat vsetkych produktov daneho intemetoveho obchodu. Vsetky anotovane 
casti webovej stranky su okamzite vizualne farebne oznacene priamo na webovej 
stranke, ako je vidiet’ na obrazku 1. 

V ramci anotacie pouzivatel’ oznacuje rozne typy objektov, ktore su pre stranky typu 
internetovych obchodov typicke: 
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• Domenovo nezavisle atributy ako su napr. cena, nazov, alebo pocet kusov na sklade, 
ktore maju vo webovej prezentacii vsetkych produktov stabilne miesto, pripadne sa 
vyskytuju ako sucast’ URL adresy ako napr. domena alebo identifikator produktu 

• Oblast’ domenovo zavislych atributov, ako napr. uhlopriecka displeja, pocet otacok 
alebo objem, ktore sa typicky zobrazuju v nejakej tabul’ke, alebo zozname ako dvo- 
jice nazov atributu a jeho hodnota. 

• Zoznam komentarov a ich atributy, ktore su kombinaciou predchadzajucich dvoch 
typov 

• Obrazky 

• Prekliky (vratanie AJAX volanl) napr. na podstranku s komentarmi, alebo do galerie 
obrazkov, ak vsetky data nie su iba na jednej detailovej stranke produktu 

• Strankovanie (angl. pagination), pri ktorom je potrebne prejst’ niekol’ko stranok na 
najdenie vsetkych hodnot 

• Pravidla pre st’ahovanie portalu, sluziace na orezanie prehl’adavanej casti portalu 
a identifikaciu detailovych stranok 

3 Extrakcia dat 

Server prijme anotacne pravidla vo formate JSON a okrem samotneho st’ahovania a ex- 
trakcie umoznuje aj planovanie d’alslch st’ahovanl pre opakovane obnovovanie aktual- 
nosti extrahovanych udajov. Toto planovanie, ako aj monitorovanie a konfiguracia be- 
ziacich st’ahovanl su realizovane cez webove rozhranie. 

Samotna extrakcia vsetkych produktov z webu je koordinaciou dvoch nastrojov - 
crawler a extraktor. Crawler prechadza vsetky relevantne stranky intemetoveho ob- 
chodu a v prlpade, ze pri prechadzanl webovych stranok narazl na taku, ktora spina 
pravidla pre detailovu stranku produktu, spustl extraktor daneho produktu. Extraktor na 
zaklade pravidiel definovanych kombinaciou XPath a regulamych vjrazov extrahuje 
data do databazy. Ak su v pravidlach aj prekliky a strankovania, tie sa realizuju pomo- 
cou nastroja Selenium, ktory dokaze simulovat’ akcie pouzlvatel’a na webe, co umoz¬ 
nuje aj dot’ahovanie obsahu cez AJAX volania. 

4 Zaver 

Niektore metody odvodzovania pravidiel na zaklade mysacich udalosti, ktore Exago 
vyuziva, sme uz popisali v [11], Dalsie metody, ktore by este viac automatizovali ano- 
taciu su ciel’om nasho d’alsieho skumania. 

Nastroj Exago bol vytvoreny na ul’ahcenie anotacie a extrakcie dat z l’ubovol’nych 
internetovych obchodov. Je vsak pouzitel’ny na anotaciu a extrakciu aj inych zoznamov 
objektov z webovych portalov reprezentovanych svojimi atributmi. 

Podakovanie: Tuto pracu podporilo MSVVaS SR v ramci projektu VEGA 1/0073/15. 
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Annotation: 

Extraction of structured objects from web portals by few’ clicks. 

The paper presents the annotation tool Exago combined with server for crawling and extraction 
of products from e-shops. With Exago, the add-on for Firefox, a user can annotate product detail 
page in web browser mostly by clicks. The tool allows annotating attributes, images, comments 
and can deal with clicks and paginations needed to gain all relevant product data. It is also a 
configuration tool for crawling to reduce the number of pages to crawl from and identify the 
product detail pages. The server extracts data of all products from the annotated e-shop to rela¬ 
tional database in structured form. The possible clicks are made by Selenium that can handle 
regular links as well as AJAX calls. Exago tool is created mainly to extract e-shop products’ data, 
but it can be used to extract other lists of objects with similar object-attribute structure as well. 
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Abstrakt. Tento prispevek se venuje tvorbe webove aplikace pro vizualizaci dat 
s vyuzitim JavaScriptu. Klientska cast aplikace je implementovana ve frame¬ 
worku Angular od spolecnosti Google a serverova cast je realizovana v jazyce 
PHP. V clanku porovnavame existujici moznosti vizualizace dat v oblasti webo- 
vych technologii a zabyvame se srovnanim nejpouzivanejsich JavaScriptovych 
frameworku a knihoven Angular, React a jQuery. Tyto technologie jsou porov- 
navanyjak z hlediska narocnosti implementace, tak z pohledu vykonnosti. Na- 
konec prinasime i navod na konverzi pluginu ze starsi knihovny jQuery do no- 
vejsiho Angularu. 

Typ prispevku: Aplikacni prispevek 

Klicova slova: data, vizualizace, JavaScript, Angular, grafy 


1 Uvod 

S rostoucim mnozstvim dat roste nutnost tato data spravne vizualizovat. Diky vizuali¬ 
zaci si muzeme uvedomit souvislosti mezi daty a vizualizace nam slouzi k rychlemu 
prehledu o situaci, ktery bychom z objemnych dat urcili jen velmi obtizne. V tomto 
clanku se budeme venovat vizualizaci dat na webu s pouzitim modemiho frameworku 
Angular. 

Angular [ 1] je javascriptovy framework (odtud take pojmenovani AngularJS) vyvi- 
jeny spolecnosti Google. Strucne si ukazeme jak s jeho pomoci vizualizovat data. Po- 
rovnani Angularu s knihovnami React [3] a jQuery [4] je velmi diskutovane tema. Proto 
je zde srovname jiz v konkretnich prikladech vizualizace dat. Ukazeme si klady a za- 
pory jednotlivych technologii apopiseme si, vjakych projektech je vhodne uvedene 
technologie pouzit. 
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2 AngularJS a jine technologie 

Nez se pustime do samotne implementace vizualizacni aplikace v Angularu, musime 
vybrat spravnou technologii pro zobrazovani dat. V praci pouzivame ruzne druhy vizu¬ 
alizace dat na webu. Zde si ukazeme (viz tabulku 1), k cemu jsou vhodne, a jaka maji 
omezeni. Konkretne se podivame na technologie HTML5 canvas, SVG a HTML ele- 
menty. 


Tab 6. Porovnani technologii pro vizualizaci dat. 



can¬ 

vas 

SVG 

HTML a 

CSS 

Vykresleni sloupcoveho grafu 

ano 

EM 

ano 

Vykresleni kolacoveho grafu 

ano 

EM 

ne 

Vykresleni spojnicoveho grafu 

ano 

ESI 

ne 

JS udalosti se mohou vazat k vykreslenym elemen- 
tum 

ne 

a 

ano 

Ovlivnovani barev grafu pomoci CSS 

ne 

ano 

ano 

Pro zakladni kresleni neni nutne pouzivat JavaScript 

ne 

ano 

ano 

Responzivnost 

ne 

ano 

ano 


Na Angularu nas na prvni pohled nejvice zaujme sablonovaci system. Nicmene Angu¬ 
lar toho nabizi mnohem vice vcetne velkeho mnozstvi predpripravenych sluzeb a moz- 
nosti jak nasi aplikaci rozsirovat. Dalsi vyhodou Angularu je, ze jiz od zacatku nam 
dava moznost testovani. Dalo by se tedy rici, ze Angular nas od zacatku vede k dobrym 
navykum vytvareni aplikace. JQuery je knihovna, ktera mimo jine umoznuje menit a 
vykreslovat HTML do stranky, reagovat na udalosti a tvorit animace. React je kni¬ 
hovna, ktera se zameruje na vykresleni a zmenu HTML Prehled nekterych vlastnosti 
Angularu a jeho dvou altemativ je v tabulce 2. 


Tab 7. Porovnani nekterych vlastnosti Angularu, Reactu a jQuery. 



Angular 

React 

jQuery 

Prace s HTML DOM 

ano 

ano 

ano 

Animace 

ano 

ano 

ano 

Provazani dat s vykreslovanym HTML 

ano 

ano 

ne 

HTML sablony 

ano 

ano 

ne 

AJAX 

ano 

ne 

ano 

RESTful API 

ano 

ne 

ne 

Podpora MVC architektury 

ano 

ne 

ne 

Rok vydani 

2009 

2011 

2005 

Typ 

framework 

knihovna 

knihovna 
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3 Dosazene vysledky 

V jednom z testu vykonnosti technologii pro vizualizaci dat jsme se zamerili na mefeni 
casu prirno v prohlizeci. Vysledne prumeme easy nekolika fazi peti vykreslovani sloup- 
coveho grafu jsou zobrazeny na Obr. 1. Test dopadl die ocekavani. JQueryje nejrych- 
lejsi pri vykonavanl skriptu. Angular a React krome kontroly zavislosti navlc porovna- 
vajl data a vykreslujl pouze ta, ktera se zmenila. JQuery je ale narocne na samotne 
vykreslenl (rendering). To z duvodu, ze se prekreslujl i ty sloupce, u kterych se hodnota 
nezmenila. Ve vyslednych casech je pak nejrychlejsl Angular. Naopak nejpomalejsl je 
React. To muze byt zpusobeno tlm, ze graf v Reactu je do aplikace napojen pres direk- 
tivu ngReact, ktera umoznuje obe technologie propojit. NgReact tak muze mlt vliv na 
cas vykonavanl skriptu grafu. 



Obr. 1. Porovnanl jQuery, Angularu a Reactu na urovni prohllzece. 

V teto praci jsme se z casti venovali i prepisovanl existujlclch pluginu ze stars! techno¬ 
logie jQuery do novejsl technologie Angular. (Prlkladem takoveho pluginu muze byt 
plugin, ktery vytvorl z existujlclch dat spojnicovy graf.) Mame hned nekolik moznostl 
jakpostupovat (viz tabulku 3). Vsechny tyto varianty jsou odlisne a ovlivnujl vyslednou 
narocnost a kvalitu prepsaneho resenl. 


Tab 8. Porovnanl jednotlivych postupu prepisovanl pluginu z jQuery do Angularu. 



Vyhody 

Nevyhody 

Postupne prepisovanl 

V kazde fazi je program spusti- 
telny. Nemusime predem znat 
veskere funkce Angularu 

Vysledny kod je 
zpravidla vice za- 
visly na jQuery, nez 
by musel. 

Pridavanl funkcionality do 
predem vytvorene direk- 
tivy 

V kazde fazi je program spusti- 
telny. Nejsme tak zavisli na kni- 
hovne jQuery. 

Jsou kladeny vetsi 
naroky na komplexni 
znalosti Angularu. 
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Z jQuery prevzit pouze 
matematicke vzorce 


Nemusime byt vubec za visit 
na knihovne jQuery. Toto resent 
ma nejvetsi potencial vyuzit nej- 
lepsi praktiky vytvareni kodu 
(best practices) z Angularu. 


Vysoka narocnost. 
Komplexni znalost 
Angularu a postupu 
na vytvareni direk- 
tiv. 


4 Zaver 

Cllem prace zminovane v tomto clanku bylo vytvorit sadu modulu pro vizualizaci dat 
v Angularu. V praci ukazujeme, jake technologie je mozne pouzlt pri vizualizaci 
na webu, a porovnavame tri ruzne knihovny. Take jsme nastinili postup, jakje mozne 
prepsat do Angularu nektery z existujicich pluginu. V tuto chvili je vyvijen Angular 2, 
avsak aplikaci jsem vytvorili v Angularu 1, nebot’ Angular 2 je stale jeste v beta verzi. 
Zdrojove kody aplikace jsem zverejnili na portalu Github [2] a uvolnili jsme je i pod 
nejpouzivanejsimi licencemi GPL a MIT. 

I kdyz byl Angular v testech vykreslovani grafu v prohlizeci nejrychlejsi, je treba 
take prihlednout k tomu, ze se jedna o klientskou aplikaci, ktera se pred spustenim musi 
do prohlizece stahnout. Je tedy nutne brat v uvahu rozsah (pocet znaku) provadeneho 
skriptu. V tomto ohledu je Angular az za Reactem a nejhure dopadl plugin v jQuery, 
jehoz stazeni do prohlizece kvuli nejvetsimu poctu znaku trvalo nejdelsi dobu. Takze 
die provedenych experimentu je lepsi pouzivani jQuery tam, kde vytvafime serverove 
orientovanou aplikaci, tedy aplikaci, kde vetsi cast vcetne renderovani HTML vytva- 
rime na serveru a Javascript v tomto pripade pouzivame jen pro drobne upravy stranky, 
jako jsou validace a animace. Pokud ale vytvafime jednostrankovou aplikaci, tedy apli¬ 
kaci, ve ktere vetsinu casu zustaneme na jedne strance, je lepsi pouzit novejsi techno- 
logii Angular, ktera je ale vykonnostne srovnatelna s Reactem. 

Podekovani: Tato publikace byla podporena projektem LO1506 Ministerstva skolstvi, 
mladeze a telovychovy CR. 
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Annotation: 

Data Visualization Using AngularJS 

This paper deals with the creation of a data visualization web application using JavaScript. The 
client-side application is implemented in the Angular framework from Google and the server side 
is developed in the PHP programming language. In the article, the existing possibilities of data 
visualization in the field of web technologies are compared and we also compare the most widely 
used JavaScript frameworks and libraries like Angular, React, and jQuery. We assess these tech¬ 
nologies from the perspective of implementation requirements as well as the performance point 
of view. Finally, we also discuss how plugins from the older jQuery library can be converted to 
the newer Angular framework. 
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Abstrakt. Informacia o aktualnej emocii pouzivatel’a je cennou spatnou vazbou 
vyuzitel’nou pri adaptivnom spravani aplikacii ako aj pri post-hoc analyze ich 
pouzitel’nosti. Pre spol’ahlive zist’ovanie emocii sa vsak spravidla musime spo- 
l’ahnuf na specialny hardver, ktory nema dobru penetraciu a je intruzivny. Kom- 
promisom sa javi byf pouzitie karnier, sl’ubne je vsak aj pouzivanie cenovo do- 
stupnych EEG senzorov. V tomto clanku prinasame dve studie, majuce za ciel’ 
porovnanie existujucich pristupov merania emocii pouzivatel’ov. 

Typ prispevku: Vyskumny prispevok 

Kl’iicove slova: emocie, rozpoznavanie tvare, EEG, spravanie pouzivatel’a 


1 Uvod: Odhad emocii pouzivatel’a je na nezaplatenie 

SpoFahlivy odhad aktualneho emocionalneho stavu pouzivatel’a je cennou informaciou. 
Vyznamnou oblast’ou jeho vyuzitia je adaptacia a personalizacia v inteligentnych sys- 
temoch. V reakcii na informaciu o emocionalnom stave by bolo mozne prisposobovat’ 
obsah na socialnej sieti, ci nastavovat’ narocnost’ uloh vo vzdelavacom systeme. Dalsou 
oblast’ou vyuzitia je vyhodnocovanie pouzitel’nosti softveru: pri pouzivatel’skych stu- 
diach nas vyskyt emocie, specialne negativnej, moze rychlo upozomit’ na problema- 
ticke miesta v rozhraniach a scenaroch. 

Automaticke meranie emocii mame dnes k dispozicii ako produkt v podobe roznych 
zariadeni a softveru. Zaroven je vsak predmetom vyskumu, pretoze existujuce pristupy 
a riesenia nie su idealne z hl’adiska presnosti, neinvazivnosti a dostupnosti. Tieto tri 
vlastnosti stoja navzajom proti sebe a posilnenie jednej znamena ustupky v druhej. Je- 
den extrem predstavuju pristupy odhadujuce emocie z tradicnych perifemych zariadeni 
(klavesnica, mys) [6] a z odhadov semantiky pouzivatel’skych akcii [5, 7]. Kedze ne- 
vyzaduju specializovany hardver, su vysoko dostupne a neinvazivne, no zaroven vel’mi 
nepresne a nie vzdy aplikovatel’ne. Opacnym extremom su presne, no invazivne natelne 
senzory fyziologie l’udskeho tela (merace dychania, tepu, vodivosti koze) [8], ktore 
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maju zaroven nevyhodu maleho rozsirenia pre studie vacsieho rozsahu. Do tejto kate- 
gorie zariadeni mozeme zaradit’ aj elektroencefalograf (EEG), ktory snima elektricke 
signaly z mozgu. Ako kompromis s ohl’adom na vsetky tri vlastnosti sa javi pouzitie 
rozpoznavania tvare pomocou kamier, najma hlbkovych. Kamery nie su invazivne, 
casto stadia obycajne webove kamery a aj v pripade specialnych hlbkovych kamier (ako 
napr. Kinect ci Creative Senz3D‘) maju urcite rozsirenie (najma vd’aka hernemu prie- 
myslu). Na druhej strane su spravidla citlivejsie na vplyv vonkajsich cinitel’ov, akymi 
su naprlklad nevhodne osvetlenie alebo okuliare, ktore ucastnlci nosia. S prlchodom 
EEG senzorov Epoc od firmy Emotiv 1 2 (ale aj d’alslch, napr. od Neurosky 3 4 5 ), ktore su 
v porovnanl s klasickym EEG menej invazivne, vyzaduju mensiu obsluhu a su cenovo 
dostupne, mozeme hovorit’ o pokuse o prienik EEG senzorov do tejto tretej kategorie 
zariadeni. Otazna vsak zostava ich presnost’ a teda aj moznost’ spol’ahliveho pouzitia 
pre ulohu detekcie a merania emocii. 

Metodam merania emocionalneho stavu cloveka venujeme pozomost’ aj v ramci ak- 
tivlt v laboratoriach Vyskumneho centra pouzlvatel’skeho zazitku a interakcie 
(UXI@FIIT, http://uxi.sk). Mame k dispozicii zariadenia (spolu s obsluznym softve- 
rom), ktore su na meranie emocionalneho stavu vyuzitel’ne: hlbkove kamery, EEG sen- 
zory, okulografy (sledovace pohl’adu), fyziologicke senzory. Ciel’om studii prezento- 
vanych v tomto prispevku je experimentalne overenie kvality merania emocii pomocou 
dostupnych zariadeni a zhodnotenie moznosti ich vyuzitia pre ulohy spojene s prispo- 
sobovanim a overovanim pouzitel’nosti. 

2 Studia 1: Riesenia zalozene na rozpoznavani tvare 

V prvej studii sme sa zamerali na pristupy ziskavania emocii zalozene na rozpoznava¬ 
nie vjrazov tvare. Islo o kvalitativnu studiu s ciel’om preskumat’ a porovnat’ moznosti 
dvoch nastrojov: Noldus FaceReader 4 a Shore 5 (Fraunhofer IIS). Skumali sme, ake po- 
nukaju moznosti analyz a automatickeho vyhodnocovania, a ako sa vedia vysporiadat’ 
s negativnymi cinitel’mi. 

Ucastnikom studie sme vo webovom prehliadaci premietali seriu 35 obrazkov, ktore 
sme vybrali z anotovanej datovej mnoziny [2], Tato datova mnozina povodne obsaho- 
vala 730 obrazkov, ku ktorym bola priradena hodnota naboja (angl. valence) a vybude- 
nia (angl. arousal). Kazdy z 35 obrazkov sme ucastnikovi ukazali na 7 sekund, pocas 
ktorych sa mal rozhodnut’ o tom, ci na neho vplyval pozitivne (1) alebo negativne (-1), 
a to na spojitej stupnici v intervale od -1 po 1. Pocas celeho experimentu sme zaroven 
zaznamenavali tvar ucastnikov. 

Nastroj Noldus FaceReader poskytuje analyzu a vizualizaciu emocii z l’udskej tvare. 

V ramci nasich experimentov sme zistili, ze nastroj dokaze v relativne rychlom case 


1 http://us.creative.eom/p/web-cameras/creative-senz3d 

2 http://emotiv.com/ 

3 http://neurosky.com/ 

4 http://www.noldus.com/human-behavior-research/products/facereader 

5 http://www.iis.fraunliofer.de/en/ff/bsy/tech/bildanalyse/shore-gesichtsdetektion.html 
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analyzovat’ obrazky, zaznamenane videa a zivy prenos z kamery pocitaca. Medzi za- 
kladne vlastnosti patrilo rozpoznanie siestich emocionalnych stavov: radosti, smutku, 
hnevu, prekvapenia, strachu, znechutenia a neutralneho stavu. Z pokrocilych moznosti 
mozeme vyzdvihnut’ rozpoznavanie zakladnych crt ucastnika (vek, pohlavie, etnicka 
prislusnost’), podrobnych crt tvare (otvorene/zatvorene oci a usta, pritomnost’ brady, 
fuzov a okuliarov) a tiez polohy tvare. Vyhodou nastroja je, ze dokaze rozpoznavat’ 
emocie na zaklade styroch modelov tvare: vseobecny, detske tvare, tvare vychodoazij- 
skych l’udi a tvare starsich l’udi. Nastroj sa dokaze spresnovat’ volitel’nou kalibraciou 
a vysporaduva sa aj s horsimi sveteln>Tni podmienkami, hoci pri protisvetle sme zazna- 
menali problemy u l’udi nosiacich okuliare. 

Obdobnym nastrojom urcenym najma pre komercne ucely bol Shore (od Fraunhofer 
IIS), ku ktoremu sme mali k dispozicii iba demo verziu. Z pohl’adu kvality boli vsak 
oba softvery vel’mi vyrovnane. Vyhodou nastroja Shore bola analyza emocii v realnom 
case. Okrem rozpoznavania emocie umoznoval detegovat’ rozne vel’kosti a otocenia 
tvare, rozpoznanie oci, list, pohlavia a veku. Tieto moznosti vsak boli dostupne samo- 
statne, a nie ucelene ako v pripade nastroja Noldus FaceReader. Dalsou devizou na¬ 
stroja FaceReader bola kvalitnejsia kontinualna kalibracia ucastnikov. Zaroven nastroj 
Shore neumoznoval vyjadrit’ percentualnu mieru detegovanej emocie. 

3 Studia 2: Riesenie zalozene na EEG 

V druhej studii sme sa zamerali na rozpoznavanie emocii s vyuzitim elektroencefalo- 
grafu (EEG) pomocou nami navrhnutej metody, ktora vyuziva metodu podpornych 
vektorov (angl. support vector machines, SVM) na natrenovanie klasifikatora urcuju- 
ceho jednu zo siedmich emocii (radosf, smutok, znechutenie, hnev, strach, prekvapenie 
a neutralnu emociu, t. j. rovnake ako Noldus FaceReader ). Crtami vyuzitymi pri klasi- 
fikacii su sila alfa a beta vln vypocitana z nameranych hodnot elektrickeho signalu, 
hodnoty naboja a vybudenia (angl. valence a arousal) [1] a ich extremne a priememe 
hodnoty pre dany stimul. 

Pre overenie nasej metody, ako aj presnosti a spol’ahlivosti EEG senzora Epoc od 
firmy Emotiv sme zrealizovali studiu vychadzajucu z predchadzajucej prace v tejto ob- 
lasti [4], Studia pozostavala z 20 jednominutovych usekov hudobnych videi, ktore mali 
v ucastnikoch evokovat’jednu dominantnu emociu. Vacsina videi bola prebrana z [4], 
Na orchestraciu experimentu sme pouzili Tobii Studio, ktore umoznovalo zobrazenie 
videi a knim prisluchajucich otazok vo zvolenom poradi; data zo sledovania pohl’adu 
ziskane Tobii Studiom sme v ramci tejto studie nevyhodnocovali. Okrem toho sme na- 
hravali tvare ucastnikov pomocou kamery Creative Senz3D; nahravky sme pouzili na 
urcenie emocie pomocou rozpoznavania vjrazu tvare nastrojom FaceReader. 

Pred kazdym videom sa zobrazila na pat’ sekund ciema obrazovka s bielym fixac- 
nym krizom uprostred, na ktory sa mal ucastnik pozerat’. Po kazdom takomto jednomi- 
nutovom videu nasledovalo vyplnenie dotaznika, ktory pozostaval z troch otazok: (/) 
„Aka silna bola emocia, ktoru ste pocit’ovali?", ( ii) „Aka pozitivna bola emocia, ktoru 
ste pocit’ovali?“, (Hi) „Aka emocia u vas prevladala najviac?“. Na prve dve otazky ucas- 
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tnici odpovedali zvolenim hodnoty od 1 do 10, pricom zvolene hodnoty indikovali sub- 
jektlvne hodnotenie naboja a vybudenia ucastnika. Pri poslednej otazke si vyberali 
jednu zo siedmich vyssie uvedenych emocii. 

Studiu sme zrealizovali vo vyskumnom UXI centre na FIIT s 9 ucastnikmi, pricom 
jedno sedenie trvalo priblizne 40 minut. Celkovo sme tak nazbierali datovu sadu 180 
emociou ohodnotenych videi, ktoru sme vyuzili na experimentalne overenie nami na- 
vrhnutej metody. Pri vyuziti 5-nasobnej krizovej validacie sa nam podarilo dosiahnut’ 
priememu uspesnost’ urcenia spravnej emocie na testovacej mnozine 58% so standard- 
nou odchylkou ±6%. Museli sme sa pritom vysporiadat’s nevyvazenost’ou datovych 
vzoriek, ked’ niektore emocie (napr. hnev alebo strach) boli v datovej sade zastupene 
len minimalne; za t>mito ucelom sme vyuzili nadvzorkovanie pocas fazy trenovania. 
Vysledky sme tiez porovnali s urcenim emocie rozpoznavanim vjrazov tvare pomocou 
nastroja Noldus FaceReader. Tento nastroj pre dany casovy moment neurcuje jednu 
konkretnu emociu, ale pomer emocii. Pri zapocitani len dominantnej emocie sme do- 
siahli uspesnost’ 19%, co je v>razne menej ako pomocou EEG senzora. 

4 Pripravenost’ technology pre ich vyuzitie vo vyskume a praxi 

Prva realizovana studia bola sice len kvalitativna s ciel’om preskumat’ moznosti existu- 
jucich nastrojov, vysledky rozpoznavania emocii pre prezentovane stimuly (obrazky) 
vsak boli povzbudive, aj ked’ by si vyzadovali pre potvrdenie rozsiahlejsi experiment. 
Na druhej strane, na ulohe rozpoznavania emocii pri pozerani hudobnych videi dosiahol 
pristup vyuzivajuci detekciu vyrazov tvare podstatne horsie vysledky ako EEG. Moze 
to byt’ sposobene jednak samotnym pouzitim EEG senzora, ktore mohlo v ucastnikoch 
vyvolat’ istu stmulost’, ale aj charakterom prezentovanych videi, pri ktorych bola evo- 
kovana emocia zrejme slabsia. V buducnosti by tak zrejme bolo dobre nebrat’ do uvahy 
len dominantnu emociu, ale aj d’alsie detegovane z vyrazu tvare, ked’ze aj pri prirodze- 
nych stimuloch (nezameranych na vyvolanie konkretnej emocie; napr. aplikacia, s kto- 
rou pracuje pouzivatel’) je predpoklad, ze prejavena emocia bude slabsia. 

Pri EEG senzoroch sa ukazalo, ze hoci su uz cenovo pristupne, stale vyzaduju rela- 
tivne zdlhavu fazu pripravy (napr. vlhcenie elektrod), co zatial’ zamedzuje ich vacsiemu 
rozsireniu v tejto oblasti. Zrealizovali sme maly experiment (na troch ucastnikoch) aj 
s jednoduchsim zariadenim Insight od firmy Emotiv, ktory by mal tuto barieru prekonat’ 
(za cenu mensieho poctu elektrod), ale jeho spol’ahlivost’ a presnost’ sa ukazali pre dany 
typ ulohy nepostacujuce. Buducnost’ tak zrejme spociva v d’alsom rozvijani senzorov, 
aby boli co najmenej invazivne a zaroven dostatocne spol’ahlive a presne, kde sa zauji- 
mavym javi byt’ okulometer (sledovac pohl’adu). Tento dokaze merat’ vel’kost’ zrenicky, 
ktora indikuje emocne vybudenie cloveka [3], Dalsi perspektivny smer predstavuje 
kombinacia roznych pristupov. 

Pod’akovanie: Tato publikacia vznikla vd’aka ciastocnej podpore projektov APVV-15- 
0508, VG 1/0646/15 a KEGA 009STU-4/2014. 
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Annotation: 

EEG and Face Recognition: Comparison of Approaches for Emotion Detection 

Information on the current user emotion is a valuable feedback that can be used for adaptation of 
the behaviour of applications as well as for post-hoc analysis of their usability. In order to obtain 
a reliable emotion detection, we usually have to rely on the specialised hardware, which has low 
penetration and is intrusive. Using web cameras seems a compromise, promising is also the use 
of affordable EEG sensors. In the paper we present two studies that aim to compare the existing 
approaches of user emotion detection. 
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Abstract. Recently, the terms Internet of Things (IoT), Big Data and Machine 
Learning become very hot topics in both research and commercial spheres. 
IoT refers to the world of devices connected to the Internet, which is the way the 
massive amount of data is continuously collected, concentrated and managed. 
Raw data can also come from other processes such as information retrieval, web 
monitoring, database systems and so on. Mining in such data means of analysis 
in order to obtain usable results and/or knowledge. This paper presents several 
considerations about large-scale data, data processing and data mining using ma¬ 
chine learning techniques with technological backgrounds towards high perfor¬ 
mance computing (HPC), Apache Spark and GPU that enable and accelerate the 
whole process. 
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1 Introduction 

It is clear that machine learning (ML) algorithms learn from data and data is de facto 
the heart of many solutions. The availability of high performance infrastructures, tech¬ 
nologies and available machine learning libraries in combinations with computational 
and/or data intensive strategies open nearly unlimited possibilities for data mining 
(DM). However, one important point is the flexibility of a solution design, which must 
be done around, at least, the 3Vs ( Volume, Velocity and Variety) of data towards effi¬ 
ciency criterions such as resources, performance, cost efficiency, etc. A universal solu¬ 
tion for the "Big Data" challenges does still not exist, however the coupling of strate¬ 
gies and technologies upon mathematical backgrounds and data-centric approach based 
on real requirements is a good starting point. In practical scenarios with big and large- 
scale data contexts, the use of incremental algorithms is visibly increased [4] [6] with 
satisfied reported results of models’ performance in comparisons with traditional in¬ 
memory algorithms. 
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2 Data mining using machine learning techniques 

Nowadays, the global data production is continually increased by worldwide distributed 
ubiquitous sensors for long-term monitoring. Mining in such data means of analysis in 
order to obtain usable results and/or knowledge. Currently, ML techniques in general 
and supervised learning approaches in particular, play the central role in many practi¬ 
cal/commercial cases. In general, ML approaches can be divided [1] into: 

• Traditional in-memory learning (offline learning) where whole data for training can 
be loaded into machine memory. The main advantage of this approach is in many 
existing algorithms, number of available libraries, each with numerous methods and 
implementation improvements to achieve precise results. The disadvantage is the 
memory limitations that imply only use of small data sets. 

• Incremental learning (online learning) does not require the whole data to be loaded 
into the machine memory at once. Instead, it loads the data in batches. These algo¬ 
rithms use limited memory and limited processing time per item, therefore, the input 
data set can be large-scale without memory limitation. On the other hand, the number 
of available algorithms are limited in comparison to in-memory approach. 

• Distributed learning: which is typically coupled with infrastructure i.e. DAS (Data 
Analytics Supercomputer e.g. Apache Spark [2]). It is usually applied on very large 
data sets, which do not fit into memory of one machine. DAS is usually utilized also 
as a whole ecosystem with data processing, data integration and data management. 

If a set of ready for use machine learning methods is extensive, their implementations 
are also rich and available in many languages with many versions and improvements. 
The most well-known ML libraries (or collections) are (Tab L): 


Tab 1. The most well-known ML libraries 


Library (impl. language) 

Strong points 

Weak points 

Weka3 (Java) 

general purpose, 

GUI, popular 

small datasets, 
GUI, popular 

MOA (Weka related) 

data stream mining, concept 
drift, recommender systems 


R, Python (and libraries) 

statistics, ML, very popular 

R vs. Python 

RapidMiner 

general purpose, DB con¬ 
nection, popular 


Scikit-Learn (Python) 

general purpose, popular 

small datasets 

NLTK (Python) 
Clojure 

general purpose, natural lan¬ 
guage toolkit and text mining 

small datasets 

PyBrain (Python) 

neural network, reinforcement 
learning, evolution, easy use 

good for study and 
experiments 

MLLib (Scala, Java) 

Spark distributed scalable ML 
framework, growing community 

coupled with in¬ 
frastructure 

Mahout (Java) 

Hadoop ML framework 

come with Hadoop 
overhead 
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H20.ai 

massively scalable Big Data 
analysis, distributed processing 
(Hadoop, Spark) 


Shogun (C++) 

general purpose, designed for 
large scale learning, kernel 
methods, SVM, HMM 


LIBSVM (C++) 
LIBLINEAR (C++) 

integrated software, 
large-scale data 

narrowed approach 

Vowpal Wabbit (C++) 

fast out-of-core ML system, on¬ 
line learning 

limited number of 
algorithms 

XGBoost 

parallelized general purpose 
gradient boosting library 

narrowed approach 

MatLab, GNU Octave 

scientific libraries 

math oriented 


One of the most used data mining concept and methodology [8] is CRISP-DM (Cross- 
Industry Process for Data Mining), which consists of six steps: Business Understand¬ 
ing, Data Understanding, Data Preparation, Modeling, Evaluation and Deployment. 
The Data Preparation step consists of sub-steps: Data Transformation, Exploratory Data 
Analysis (EDA) and Feature Engineering. The group of the first five steps are also 
called the development phase. The deployment step is also called the production phase. 

Although the main interest upon DM/ML is broadly paid to the Modeling step and 
algorithms, one important point remains the fact that ML algorithms learn from data. 
Therefore, in practice. Data Understanding and Data Preparation can consume up to 
80% of the entire time of every DM using ML techniques project. Data Preparation is 
also slangy labeled Data Munging or Data Wrangling, which refer to strenuous work. 
Certain problem-solving techniques e.g. Forward Selection, Backward Eliminations in 
the Feature Engineering sub-step or grid-search in the Modeling step can lead to com¬ 
putational intensive tasks especially when ML input data is large-scale or big. HPC 
(high-performance computing) cluster can be utilized for concurrent training of models 
in order to shorten the development time. 

In the following parts, some practical notes around data processing and DM process 
using ML techniques for commercial and research applications with USAS participa¬ 
tion in recent years are presented. 

Malicious behavior detection in mobile devices log domain. When everyone owns 
and uses mobile devices such as smartphones and/or tablets, the demand of cybersecu- 
ritv and situational awareness is pushing towards. This involved work was a part of the 
six-month pilot research done for IBM Slovakia. The interest was if it is possible to 
detect malicious behaviors of mobile devices based on collected logs of mobile devices. 
Raw data - logs from mobile devices belongs to human-generated data class, which are 
not so “Big" as machine-generated data. Data mining using ML techniques in this do¬ 
main involved through following obstacles: 

• Collected raw logs are extremely noisy for the specific detection purpose. The logs 
contain a lot of information about continuous monitoring processes such us timing 
(clocks, alarms, calendars), positions, accelerators, display setting and adapting, net¬ 
work and power monitoring, scanning processes, etc. 
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• Low occurrences of malicious behaviors - malware related activities, which caused 
imbalanced classes of data used for supervised ML; 

• Feature extraction for data with evolving characteristics i.e. number of applications 
on mobile devices is changed based on users’ demands without any limitations; 

• Privacy preserving data mining of personal sensitive information. 

• DM process required thorough Data Understanding in collaboration with domain 
experts, Data Preparation (especially EDA) and Feature Engineering. ML technique 
applied in this case was simple supervised binary classification with incremental 
learning. The obtained results were highly satisfied to distinguish malicious behavior 
from the normal one. 

Click-through-rate advertising : raw and ML data are really big in both development 
and production phases. Applied analyzing techniques are e.g. reservoir (sub)sampling, 
biases monitoring, smoothing, sliding windows with settable size, forgetting mecha¬ 
nism, etc. came with adaptive online learning (retraining in combination with incre¬ 
mental adaptation). ML data is highly imbalanced as usually in many commercial cases 
that implies boosting one class against the second by reducing number of negative ex¬ 
amples. Feature selections and feature combinations are also utilized to improve mod¬ 
els’ performance. The production infrastructure is high-performance Fiadoop cluster of 
the Magnetic Media Online, Inc. technology company (USA). 

Power utility for functional awareness of monitoring stations', raw input data in this 
case is quite interesting, it is not “Big” in any one of 3Vs, but contains pure numerical 
and structured data collected from monitoring stations during several years. Such data 
can be called large-scale, which causes computational intensive tasks with memory 
consumption in the development phase. The question was if it is possible to realize the 
production on single machine with limited memory due to cost and energy efficiency. 
The solution can be any of traditional in-memory approach in a machine with larger 
memory, incremental learning or distributed learning with Spark installation in single 
machine for the production. Flowever, the use of the incremental learning to overcome 
machine memory limitation can be the less painful way on both phases. 

3 Machine learning and many-core accelerators 

In the recent years the accelerators have been successfully used (not only) in machine 
learning and deep learning applications [4], Manufacturers often offer the possibility to 
enhance hardware configuration with many-core accelerators to improve machine/clus¬ 
ter performance. If we look at the list of top 500 most powerful supercomputers, we 
can see the increasing trend in both number of systems that employ the accelerators and 
their performance share. Most popular models of accelerators are based on MIC (Many 
Integrated Cores) and GPU (Graphics Processing Unit) architectures. The accelerators 
are able to offer significant performance increase for many application domains e.g. 
the work [5] realized in collaboration between TUKE (Technical University of Kosice) 
and IISAS (Institute of Informatics, Slovak Academy of Science). The main feature of 
the many-core accelerators is massively parallel architecture (e.g. new NVIDIA PI00 
accelerator contains 3840 CUD A cores), allowing them to speed up computations that 
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involve matrix-based operations, which is a heart of many ML implementations. Many 
popular ML frameworks and libraries already offer the possibility to use GPU acceler¬ 
ators to speed up learning process with supported interfaces in various languages e.g.: 


Tab 2. Popular ML frameworks and libraries 


Library 

(impl. language) 

Main purposes 

Theano (Python) 

math expression compiler 

Tensorflow 
(C++, Python) 

numerical computation library by data flow graphs 

Keras (Python) 

minimalist, highly modular neural networks library capable 
of running on top of TensorFlow or Theano 

Caffe (C/C++, 
Python, MatLab, 
CLI) 

deep learning framework for image processing 

CNTK 
(C++, CLI) 

unified deep-learning toolkit that implements CNN and 
RNN training for speech, image and text data 

DL4J 

(Java, Scala) 

distributed deep-learning library written for Java and Scala, 
integrated with Hadoop and Spark 

Neon (Python) 

Nervana’s Python-based deep learning library 

Torch (C/LuaJIT) 

NN and optimization libraries that puts GPUs first 

MatConvNet 

Convolutional Neural Networks (CNNs) for MatLab 


Some of them also allow to use optimized CUDA Deep Neural Network (cuDNN) li¬ 
brary to improve the performance even further. Similar to the ML libraries mentioned 
in Section 2, ML libraries with GPU support are also diverted in various implementa¬ 
tion levels for various specific purposes such as image, voice and text processing. 

The demand for even more powerful hardware for deep learning applications caused 
that main manufacturer of GPU accelerators NVIDIA made considerable investments 
to the development of the new architecture called Pascal and special purpose system 
DGX-1 optimized for many-layered DNN. Among the new features most notable are 
the ,, half-precision ”, which allows to reach 21.2 Teraflops and 160 GB/s bidirectional 
interconnect that significantly improves the scalability in multi-GPU systems. 

The matrix-based operations on Apache Spark can be computationally accelerated 
under same logic like GPU/CUD A acceleration. Here is a similar logic between Apache 
Spark vs. GPU processing (not only) from ML viewpoint: 

• If data fits into memory of one machine, GPU is faster, otherwise Spark; 

• Spark logic is similar to CUDA host logic in the mean of SIMD processing; 

• Spark network overhead vs. PCI-express transfer overhead; 

• MapPartitions is like kernel launch, partitions are like CUDA blocks; 

• Model parallelism vs. data parallelism: Data parallelism presents single instruction 
to multiple data items, ideal workload for a SIMD computer architecture; Model 
parallelism gives every processor the same data but applies a different model to it; 
Hybrid approach presents combination of data and model parallelism. 
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Potential benefits 1,2 of using GPUs to further accelerate Spark performance is also done 
with positive results. 

4 Conclusions 

This paper presents a few considerations about working and mining in large-scale data 
using ML techniques in our department in recent years. We hope that such notes are 
useful for readers with nearby research interests and would like to thank to colleagues 
and reviewers for consultations and advices on the paper preparation. 
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Abstrakt. Uspesnost’ systemov pre odpovedanie otazok v komunitach (angl. 
Community Question Answering - CQA) na otvorenom webe (napr. Stack 
Overflow) viedla k ich aplikovaniu v novych kontextoch (napr. vo vzdelavani) a 
v novych prostrediach (napr. v ramci organizacii). Predstavitel’om tohto trendu 
je aj vzdelavaci univerzitny CQA system Askalot vyvijany na FIIT STU. Aby 
sme pine preskumali potencial CQA systemov v edukacnej domene, nadviazali 
sme spolupracu s vyskumnikmi z Flarvardovej univerzity s ciel’om upravit’ As¬ 
kalot ako rozsirenie do MOOC systemu edX. Zaroven pracujeme na nasadeni 
Askalotu na d’alsich univerzitach v Lugane a v Novom Sade. V prispevku opisu- 
jeme navrhove a implementacne riesenia, ktore poskytli potrebnu flexibilitu a 
skalovatel’nosf pre tieto rozne prostredia. Zaroven predstavime, ake vyskumne 
moznosti poskytuje nasadenie Askalotu v ramci domeny vzdelavania. 

Typ prispevku: Aplikacny prispevok 

Kl’iicove slova: CQA, MOOC, Askalot, zdiel’anie znalosti, komunity studentov 


1 Uvod 

Od vzniku systemov pre odpovedanie na otazky v komunitach (angl. Community Ques¬ 
tion Answering - CQA) sa stali tieto systemy signifikantnym zdrojom znalosti v pries- 
tore sucasneho webu. V najpopulamejsich CQA systemoch, ako je napr. Stack 
Overflow alebo Yahoo! Answers, komunity pozostavajuce z milionov pouzivatel’ov 
zdiel’aju svoje znalosti prostrednictvom pytania sa otazok a poskytovania odpovedi. 
V poslednej dobe uspesnost’ a popularita CQA systemov na otvorenom webe motivuje 
vyskumnu ako aj komercnu sferu pre ich pouzivanie aj v d’alsich oblastiach. V prvom 
rade bol potencial CQA systemov rozpoznany nielen v kontexte webu, ale aj v domene 
vzdelavania [1] alebo v centrach zakaznickej podpory [3], V druhom rade koncepty 
CQA systemov nemusia byt’ vyuzivane len vel’kymi otvorenymi komunitami, ale aj 
uzavretymi skupinami pouzivatel’ov v ramci organizacii (napr. v socialnej platforme 
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IBM Connect [2]). Adaptovanie CQA konceptov v tychto novych oblastiach vsak pri- 
nasa nove problemy a vyskumne vyzvy (napr. ako prisposobif funkcionalitu CQA sys- 
temov specifikam konkretneho prostredia). 

V nasej predchadzajucej praci sme identifikovali novy koncept vzdelavacieho a or- 
ganizacneho CQA systemu, ktory sme nasledne zrealizovali a overili formou systemu 
Askalot [4], V kontraste so standardnymi otvorenymi CQA systemami (napr. Stack 
Overflow), system Askalot 1 zohl’adnuje specifika vzdelavania (napr. prltomnost’ uci- 
tel’a, vyrazne odlisna uroven znalosti pouzivatel’ov) a organizacneho prostredia (napr. 
mensia vel’kost’ komunity, znamost’ pouzlvatel’ov). System Askalot je zrealizovany ako 
webova aplikacia s otvorenym zdrojovjun kodom 2 . Askalot je aktualne nasadeny na 
Fakulte informatiky a informacnych technologil Slovenskej technickej univerzity 
v Bratislave. Zahrna komunitu viac ako 1100 studentov a ucitel’ov, ktorl doteraz po- 
skytli viac ako 560 odpovedi na viac ako 430 otazok. 

Na zaklade dosiahnutych pozitivnych vysledkov sme v uplynulom obdobi nadvia- 
zali spolupracu s: 

1. Harvardovou univerzitou s ciel’om vyuzivat’ Askalot ako nahradu standardnej dis- 
kusie v MOOC (angl. Massive Open Online Courses) systeme edX. 

2. Univerzitami v Lugane a v Novom Sade v ramci kooperacneho projektu programu 
SCOPES s ciel’om nasadit’ Askalot na tychto univerzitach. 

Povodny navrh systemu Askalot (opisany v prispevku [4]) vsak bol navrhnuty speci- 
ficky pre nasu univerzity a neposkytoval tak potrebnu flexibilitu a skalovatel’nost’ pre 
tieto rozlicne vzdelavacie prostredia. Dosledkom toho sme museli jeho dizajn prepra- 
covat’ a vysledkom je niekol’ko dizajnovych odporucani, ktore predstavime v nasledu- 
jucej casti prispevku. 

2 Navrh systemu Askalot pre roznorode vzdelavacie prostredie 

Pokym niektore koncepty a funkcie CQA systemov vyuzivanych v specifickych pro- 
strediach suprirodzene flexibilne a skalovatel’ne, niektore vyzaduju pri sirsom nasadeni 
viacere navrhove a implementacne upravy. Zmeny vykonane v systeme Askalot sme 
rozdelili do styroch skupin. 

Modularna architektura. Nasledujuc poziadavky a specifika roznych prostredi 
sme identifikovali dve hlavne konfiguracie systemu Askalot, ktore sme kodovo oznacili 
ako Askalot @university a Askalot @mooc. Nasledne sme vytvorili tri moduly. Do pr- 
veho sme vyclenili spolocnu funkcionalitu pre vsetky prostredia (napr. vkladanie ota¬ 
zok a odpovedi alebo zoznamy pouzivatel’ov, kategorii, atd’.). Ostatne dva moduly de- 
dia vsetky funkcie z tohto primameho modulu a pridavaju specificke funkcie pre uni- 
verzitne, resp. MOOC prostredie. 

Flexibilna integracia manazmentu pouzivatel’ov. Askalot poskytuje niekol’ko 
sposobov ci uz automatickej alebo manualnej registracie a autentifikacie pouzivatel’ov. 


1 Demo systemu Askalot je dostupne na https://askalot.fiit.stuba.sk/demo 

2 Zdrojovy kod systemu Askalot je dostupny na https://github.com/AskalotCQA/askalot 
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Predovsetkym je mozne vyuzit’ LDAP autentifikaciu, ktora je dostupna na mnohych 
univerzitach. Zaroven Askalot podporuje LTI protokol (angl. Learning Tool Interoper¬ 
ability), ktory bol specificky navrhnuty pre vymenu informacii medzi vzdelavacimi 
systemami (vratane informacii o samotnych studentoch). Poslednou moznost’ou su ne- 
zavisle pouzivatel’ske ucty priamo v systeme Askalot. S vyuzitim tohto sposobu auten- 
tifikacie je mozne nakonfigurovat’ Askalot tak, ze jednotlive ucty mozu byt’ pine ano- 
nymne, co je dolezite najma v pripadoch, ked’ studenti sa odmietaju pytaf otazky po- 
kym je ich identita verejna. 

Adaptfvna a samo udrzujuca sa organizacia obsahu. Askalot poskytuje dvoju- 
rovnovu strukturu obsahu. Na prvej urovni moze pytajuci sa pouzivatel’ zaradit’ svoju 
otazku do hierarchie kategorii (tie reflektuju formalnu strukturu vzdelavania, napr. 
predmety alebo sekcie online kurzu). Na druhej urovni je mozne upresnit’ temu otazky 
s vyuzitim znaciek (angl. tags). Hierarchia kategorii je v systeme Askalot navrhnuta 
tak, ze zohl’adnuje ich pravidelne opakovanie (cez akademicke roky alebo opatovneho 
otvarania MOOC kurzov). Nasledne je mozne zvolif napr. v ktorych kategoriach sa ma 
zobrazovat’ aj obsah z predchadzajucich iteracii toho isteho predmetu/kurzu. 

Siroko dostupny prehl’ad aktivit a notifikacil. V neposlednom rade Askalot po¬ 
skytuje viacero moznosti, ako informovat’ studentov a ich ucitel’ov o aktivite. Su to pre¬ 
dovsetkym notifikacie zobrazovane priamo v systeme, ale aj sumamy email s aktivitou 
za poslednych 24 hodin. Navyse je mozne prepojit’ Askalot so socialnou siet’ou Face- 
book a notifikacie su nasledne zobrazovane priamo v tejto socialnej sluzbe. 

3 Zaver a d’alsia praca 

Na zaklade upravy navrhu a implementacie systemu Askalot sme ukazali niekol’ko na- 
vrhovych odporucani, ako mozu byt’ CQA systemy aplikovane nielen na webe, ale aj 
v konkretnej domene a prostredi, pricom bolo mozne dosiahnut’ vysoku uroven flexi¬ 
bility a skalovatelnosti. To nam v konecnom dosledku umoznuje nasadif system Aska¬ 
lot na viacerych univerzitach ako aj v MOOC systeme edX, kde sa Askalot pouziva od 
zaciatku septembra 2016 ako sucast’ kurzu QuCiyptox Quantum cryptography 3 s cel- 
kovym poctom viac ako 5200 zapisanych studentov. 

Taketo nasadenie pritom poskytuje vyznamny vyskumny potencial. Predovsetkym 
sme sa zatial’ v systeme Askalot sustredili prevazne na prisposobenie zakladnych funk- 
cii a konceptov CQA systemov. Sme si ale vedomi, ze aj metody pre podporu spolu- 
prace pocas procesu odpovedania na otazky podliehaju vplyvom specifik vzdelava- 
cieho a organizacneho prostredia. Pokym sa tymto metodam v standardnych otvore- 
nych CQA systemoch venuje dostatok pozornosti, v domenovo specifickych CQA sys- 
temoch mozeme aktualne vidief len prve vyskumne prispevky, napr. pri odporucani 
novych otazok pouzivatel’om, ktori su vhodnymi kandidatmi na poskytnutie odpovede 
(angl. Question routing) [7], 

Navyse domenovo specificke CQA poskytuju prilezitost’ overovat’ vyskumne me¬ 
tody v zivych experimentoch. Doteraz v oblasti vyskumu CQA systemov boli taketo 


3 


https://courses.edx.org/courses/course-vl :CaltechDelftX+QuCryptox+3T2016/info 
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zive experimenty len vel’mi zriedkave. Na zaklade nasho predchadzajuceho komplex- 
neho prehl’adu pristupov pre podporu spoluprace v CQA systemoch [6] sme zistili, ze 
len 3 zo 169 pristupov boli overene v zivych experimentoch. Askalot navyse poskytuje 
experimentalnu infrastrukturu [5], ktora umoznuje jednoduche prepojenie syntetickych 
experimentov na datovych sadach (zo systemu Askalot ako aj z CQA systemov zaloze- 
nych na platforme Stack Exchange) a online experimentov v systeme samotnom. 

Pod’akovanie: Tato publikacia vznikla vd’aka ciastocnej podpore projektov KEGA 
009STU-4/2014 aje ciastocnym vysledkom spoluprace v ramci projektu SCOPES 
JRP/IP, No. 160480/2015. 
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Supporting knowledge sharing in educational courses by means of CQA system Askalot 

Successfulness of Community Question Answering (CQA) systems on the open web (e.g. Yahoo! 
Answers) motivated for their utilization in new contexts (e.g. education or enterprise) and envi¬ 
ronments (e.g. inside organizations). In spite of initial research how their specifics influence de¬ 
sign of CQA systems, many additional problems have not been addressed so far. Especially a 
poor flexibility and scalability which hamper CQA essential features to be employed in various 
settings (e.g. in different educational organizations). In this paper, we provide design recommen¬ 
dations how to achieve flexible and scalable deployment by means of a case study on educational 
and organizational CQA system Askalot. Its universal and configurable features allow us to de¬ 
ploy it at several universities as well as in MOOC system edX. 
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Abstrakt. Tento clanok priblizuje problematiku tvorby prezencnych listin na 
Technickej univerzite v Kosiciach. Taktiez priblizuje moznosti optimalizacie 
tohto procesu, veduce k vyslednemu rieseniu. Riesenie je realizovane v podobe 
mobilnej a webovej aplikacie s centralnou databazou a REST API pre umoznenie 
komunikacie medzi komponentmi aplikacie, konkretnejsie medzi centralnou da¬ 
tabazou a mobilnou aplikaciou. ISIC (International Student Identity Card) karta, 
ktoru must vlastnit’ kazdy student Technickej univerzity je v tomto pripade vhod- 
nym prostriedkom pre jednoznacne urcenie pritomnosti studenta pomocou mo¬ 
bilneho zariadenia s technologiou NFC. Prinosom riesenia je optimalizovany 
proces tvorby prezencnych listin a zabezpecenie spravnosti a uplnosti udajov. Na 
dosiahnute vysledky je mozne v buducnosti nadviazat’ a pokracovat' v rozsiro- 
vani aplikacie. 

Typ prlspevku: Aplikacny prispevok 
Kl’iicove slova: prezencie, aplikacia, NFC 


1 Uvod 

Na Technickej univerzite v Kosiciach, ako aj na inych vysokych skolach a univerzitach 
je evidencia pritomnosti studentov na cviceniach, resp. aj prednaskach podmienkou 
udelenia zapoctu alebo skusky. Je mozne konstatovat’, ze vo vacsine prlpadov su pre- 
zencne listiny vytvarane vpapierovej podobe, co znamena, ze moze dojst’ k chybam 
pocas procesu ich vytvarania, prlpadne je mozne tento dokument l’ahko stratit’. Hlavnou 
motivaciou pre aplikacny vystup, ktory tento clanok popisuje, je zjednodusenie a digi- 
talizacia procesu vytvarania spomlnanych prezencnych listin. Z technologickeho po- 
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hl’adu je mozne vyuzlvat’ studentske RFID karty, zname ako ISIC (International Stu¬ 
dent Identity Card) a na strane aplikacie komunikacnu technologiu NFC (Near field 
communication) 123 . 

Rozhodli sme sa pre vytvorenie prostredia, resp. aplikacie, ktora zastresi jednak 
klientsku stranu (snimanie kariet a vytvaranie pohl’adu na prezencie studentov) ako aj 
administracnu cast’ (spravu zoznamov studentov, cviceni a prednasok). Integracnym 
aspektom celeho riesenia je databaza, ktora poskytuje pristup k datam mobilnym klien- 
tom ako aj administratorovi na webe prostrednictvom REST API sluzieb. Za zmienku 
stoji aj pouzita technologia. Mobilna aplikacia bola vytvorena v prostredi Android 
Studio v jazyku Java 1 2 3 4 5 6 , zatial’ co web ako aj REST API v prostredi Visual Studio 2013 
v jazyku C# pouzitim frameworku ASP.NET MVC 4 56 . 

2 Analyza problematiky 

Ako sme uz v uvode uviedli, proces vytvarania prezencnych listin je realizovany pod- 
pisovanim sa studentov alebo kontrolou zo strany vyucujuceho. Existuje viacero ne- 
dostatkov, ktore uvedene sposoby vytvaraju a strucne sme ich spomenuli v uvode. Ako 
najvyhodnejsie riesenie sa javi moznost’ vyuzit’ smartfon vyucujuceho a data synchro- 
nizovat’s extemou databazou. Takymto sposobom vieme zarucif, ze data budu aktualne 
v kazdom case a prezencne listiny budu zavisle primame na ISIC kartach studentov. 
Mozeme tak konstatovat’, ze aktualne sposoby vytvarania prezencnych listin nie su ide- 
alne z pohl’adu perzistencie, aktualnosti a dostupnosti. Nami vytvorena aplikacia zlepsi 
dostupnost’ prezencnych listin, pretoze budu jednak ulozene centralne v databaze ako 
aj v lokalnej databaze mobilneho zariadenia. Aktualnost’ dat na oboch stranach zabez- 
peci synchronizacia udajov pocas online rezimu aplikacie. Nevyhnutnost’ou je imple- 
mentacia funkcie pre manualny zapis studenta bez pouzitia NFC ako zalozny sposob 
pre pripady ak student kartu zabudne, poskodi sa, resp. z inych dovodov nie je mozne 
vyuzit’ technologiu snimania kariet. 


1 Prelovsky, Lukas a ini: Co je to vlastne NFC a ake ma vyuzitie. Online: 
<http://www.lukasprelovsky.sk/co-je-to-vlastne-nfc-a-ake-ma-vyuzitie/> 

2 NFC. Online: <http://www.nearfieldcommunication.org> 6.4.2016 

3 Moj android: Zaciname s NFC: Co je NFC a ako funguje. Online: 
<https://www.mojandroid.sk/nfc-co-to-je-ako-funguje/> 6.4.2016 

4 Android Developers: Android, the world’s most popular mobile platform. Online: 
<http://developer.android.com/about/android.html> 2.4.2016 (v anglickom jazyku) 

5 MVC architektura. Online: <http://www.itnetwork.cz/navrhove-vzory/mvc-architektura- 
navrhovy-vzor> 5.4.2016 

6 Coplien, James O., Reenskaug Trygve: The DCI Architecture: A New Vision of Object- 
Oriented Programming, 20.3.2009, Online < http://www.artima.com/articles/dci vision.html> 
28.6.2016 (v anglickom jazyku) 
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3 Implementacia aplikacie 

V prvom rade bolo potrebne navrhnut’ konceptualnu architekturu riesenia (Obr. 1) 
a ziskat’ tak jednoduchy pohl’ad na riesenu problematiku. Na zaklade tohto navrhu sme 
postupne vytvarali aplikacne prostredie pre jednotlive vyuzitia aplikacie. Najprv bolo 
potrebne navrhnut’ a vytvorit’ databazu, ktora bude zjednocujucim prvkom jednotlivych 
aplikacir. Tato databaza v prvotnom navrhu disponovala styrmi tabul’kami, ktore po- 
skytovali zakladne informacie o cviceniach, prednaskach, studentoch a priradenr stu- 
dentov na prednasky a cvicenia. 



Zdrojovy kdd 
webovej aplikacie 
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Obr. 1 Architektura riesenia 

Po dotvorenr finalnej verzie databazy bola vytvorena webova aplikacia pre spravu stu- 
dentov, prednasok, cviceni a ich priradzovanie na cvicenia a prednasky. Na zaklade 
testovania bol doladeny aj datovy model a doplnena celkova idea o aplikacii. 

Pre webovu aplikaciu sme zvolili formu „long scroll ", nakol’ko v sucasnosti patri 
medzi casto pouzivane trendy. V ramci stranky sa nachadza viacero sekcir zobrazuju- 
cich udaje o pritomnostiach studentov na prednaskach a cviceniach, ako aj priradenie 
jednotlivych cviceni k studentom. Pritomnost’ studentov na prednaskach a cviceniach 
je menena prostrednictvom stavoveho tlacidla (pritomny-nepritomny-bez zaznamu). 

Technicky je interaktivita webovej aplikacie zabezpecena jQuery a Ajax 78 funk- 
ciami napriklad pre zobrazenie upozomeni alebo informovani pri ukladani dat a ich 
zapise do databazy, pri prepinani zoznamov studentov medzi jednotlivymi cviceniami 
a pod. Zaradenie studentov na cvicenia sa vykonava zaskrtavanrm radiobutton- ov. 
V ramci tejto sekcie je umoznene pridavat’ novych studentov ako aj nove cvicenia. 


7 Taft, Darryl K.: jQuery Eases JavaScript, AJAX Development, 30.8.2006, Online: < 
http://www.eweek.com/c/a/Application-Development/jQuery-Eases-JavaScript-AJAX- 
Development> 28.6.2016 (v anglickom jazyku) 

8 MDN Mozilla Developer Network: Ajax. Online: <https://developer.mozilla.org/en- 
US/docs/AJAX> 5.4.2016 (v anglickom jazyku) 
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Funkcie v REST API su vytvorene ako samostatny controller webovej aplikacie. 
Sluzia pre zabezpecenie komunikacie medzi mobilnou aplikaciou a databazou. Zia- 
dosti, ako aj odpovede, su odosielane vo formate JSON 9 . 

Mobilnu aplikaciu tvoria 4 obrazovky - layoutv Umiestnene su vo FrameLayout-e a 
svojou strukturou su podobne webovej aplikacii. Vynechali sme zatriedenie studentov, 
pretoze z pohl’adu pouzivatel’skeho zazitku (UX) na mobile by bolo nevyhovujuce. 

Prvy layout (Uvod) obsahuje informacie o priemernej pritomnosti studentov na 
prednaskach a cviceniach. Tieto informacie pouzivatel’ moze zobrazit’ vo forme ciaro- 
veho grafu kopirujuceho priebeh semestra. Dalsimi layout- mi su Prednasky a Cvicenia 
a zobrazuju pritomnosti studentov pocas semestra. Zoznam studentov je tvoreny hie- 
rarchiou objektov Relative Layout, Linear Layout & Image View. Hierarchia je zobra- 
zena na nasledujucom obrazku. 



Obr. 2 Struktura informacii o studentovi 

Stavy pritomnosti sa menia podobne ako vo webovej aplikacii, teda klikanim na sta- 
vove tlacidla pritomnosti. Zmena stavu pre vybrany tyzden sa zobrazi pod menom stu- 
denta, ako aj v pravej casti obrazovky. Zapis studentov je mozne vykonavat’ aj pro- 
strednictvom NFC technologie, prilozenim ISIC karty kNFC citacke smartfonu a na- 
sledne sa dany student zapise do prezencie. Uspesny zapis je znazomeny spravou, to- 
ast- om, v dolnej casti obrazovky, ktora obsahuje meno studenta a UID ISIC karty. 

4 Zaver 

Clanok popisuje navrh a implementaciu aplikacie pre vytvaranie a spravu prezencnych 
listin. Na strane mobilneho klienta bola pouzita technologia NFC pre snimanie RFID 


9 JSON: The Fat/Free Alternative to XML, Online: <http://www.json.org/xml.html> 28.6.2016 
(v anglickom jazyku) 









257 Aplikaenv prispevok 


kariet studentov a na pozadl bola vytvorena webova aplikacia a databaza pre udrziava- 
nie takto zozbieranych udajov. Predpokladame, ze sa aplikacia bude vyuzlvat’ vpraxi 
uz nasledujuci semester po jej otestovani a odstraneni chyb. Testovanie bude sucast’ou 
tohto vyuzitia. Planujeme zaviest’ uvedeny sposob kontroly pritomnosti minimalne na 
jednom evident, pricom efektivita a spravnost’ vyuzitia riesenia bude porovnana s kla- 
sickym pristupom kontroly pred koncom semestra dotaznikovou formou, ako pre stu¬ 
dentov, tak aj pre vyucujucich. 

Vytvorenie tejto aplikacie bolo ciel’om diplomovej prace. 

Pod’akovanie: Tato publikacia vznikla vd’aka podpore v ramci operacneho programu 
Vyskum a vyvoj pre projekt "Centrum informacnych a komunikacnych technologii pre 
znalostne systemy" (kod ITMS:26220120020), spolufinancovany zo zdrojov Europ- 
skeho fondu regionalneho rozvoja (50%). Publikacia bola zaroven podporena projek- 
tom KEGA c. 014TUKE-4/2015 “Digitalizacia, virtualizacia a testovanie maleho pru- 
doveho motora pomocou stendov pre potreby modemej aplikovanej vyuky” (50%). 
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Annotation: 

Design and implementation of a application for creating presence lists using 
a smartphone with NFC 

This paper presents a problem of creating presence lists at the Technical university in Kosice. It 
also describes possibilities of optimizing this process, which leads to the final solution. The so¬ 
lution is realized in form of a mobile and web application with a central database and REST API 
for communication between components of the application, particularly between the central da¬ 
tabase and the mobile application. ISIC card, which every student of Technical University must 
possess, is in this case a suitable medium for clear identification of student’s presence using a 
smartphone with the NFC technology implemented. The asset of this solution is the optimized 
process of creating presences and enhancement of the correctness and completeness of data. It is 
possible to continue in extending the application in the future. 
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Abstrakt. Webova aplikace EasyMiner je akademicky nastroj pro ziskavani zna- 
losti z malych a stfedne velkych dat ve forme asociacnich pravidel. Nova verze 
tohoto systemu vyuzlva prostredi Apache Hadoop a Apache Spark pro zpraco- 
vani velkych datovych zdroju na vypocetnim clusteru MetaCentra sdruzeni 
CESNET. Aplikace se sklada z nekolika mikro sluzeb, ktere se staraji o nahravani 
velkych dat do distribuovaneho uloziste HDFS, transformaci dat v clusteru do 
normalizovane fonny a dolovani znalosti z datasetu v podobe asociacnich pravi¬ 
del s vyuzitim vypocetnich prostredku clusteru pomoci nastroje Apache Spark. S 
temito mikro sluzbami se da komunikovat prostrednictvim RESToveho rozhrani 
a jako celek tvori data miningovy software fungujici jako webova sluzba - SaaS. 

Typ prispevku: Aplikacni prispevek 

Klfcova slova: data mining, dolovani asociacnich pravidel, big data, hadoop 


1 Uvod 

Akademicky nastroj EasyMiner 1 je webova sluzba se zamefenlm na dolovani asociac¬ 
nich pravidel z databazl [2], Aplikace poskytuje graficke uzivatelske rozhrani a je 
schopna vykonat vsechny nutne operace pro ziskavani znalosti z dat od nahravani da¬ 
tasetu pres predzpracovanl az po samotne dolovani a interpretaci vysledku. Nova verze 
tohoto nastroje dokaze zpracovat i velka data dlky nasazenl do prostredi Apache 
Hadoop a Apache Spark a lze ji pouzlt pro akademicke ucely zcela zdarma s vyuzitim 
vypocetnlho clusteru na pude MetaCentra 2 sdruzeni CESNET (az 24 uzlu x 16 jader x 


i 


2 


http://www.easyminer.eu/ 

https://wiki.metacentrum.cz/wiki/Ftadoop 
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2 hyperthreading s vice nez 3TB RAM). Mezi nejdulezitejsi operace, ktere lze v sys- 
temu EasyMiner vykonavat, patfi: 

— Proudove nahravani datovych zdroju do datoveho uloziste 

— Predzpracovani dat pro licely rychlejsiho dolovani znalosti 

— Dolovani asociacnich pravidel die uzivatelskych pozadavku 

— Tvorba klasifikacnich modelu ze ziskanych pravidel 

— Manipulace s mnozinou ziskanych pravidel 

Nastroj lze tedy v kontextu cloudovych sluzeb zaradit do kategorie MLaaS (Machine 
Learning as a Service) a lze jej rovnez pouzit jako altemativu ke komercnim produk- 
tum, jako je napr. BigML.com ci Microsoft Azure ML, ktera je vice orientovana na 
pravidla. 

Aplikace jako takova se sklada z nekolika mikro sluzeb, ktere spolu navzajem ko- 
munikuji skrze RESTova API (viz Obr. 1). Vetsina techto mikro sluzeb pracuje ve dvou 
ruznych rezimech limited a unlimited. Rezim limited slouzi pro spravu malych a stredne 
velky datasetu, pficemz je vyuzito databaze MySQL jako primarniho datoveho skladu 
a prostredi R pro dolovaci ucely. V rezimu unlimited komunikuji sluzby prevazne 
se systemem Apache Hadoop, ktery je vyuzivan hlavne k ukladani velkych dat pomoci 
nastroje Apache Hive a pro distribuovane hledani asociacnich pravidel postavene na 
frameworku Apache Spark. 



Obr. 1 Architektura systemu EasyMiner 


2 Podpora velkych dat 

Primami backendove reseni je postaveno na knihovne arules z prostredi R [3, 4], Toto 
reseni je bohuzel spatne skalovatelne a neni vhodne pro pouziti na velkych datech; 
proto byla naimplementovana dalsi backendova vrstva, ktera se specializuje vyhradne 
na dolovani pravidel z velkych dat. 

Pro ukladani a predzpracovani dat je vyuzivano rozhrani Apache Hive. Samotne do¬ 
lovani probiha jako Spark uloha. Veskere distribuovane ulohy jsou spravovany syste- 
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mem YARN. Tato architektura je zalozena na davkovem vykonavani jednotlivych dis- 
tribuovanych operaci, tudiz se prilis nehodi pro mensi datasety, ktere mohou b>1 zpra- 
covany na jednom stroji mnohem rychleji diky in-memory real-time pristupu. Silne a 
slabe stranky jednotlivych backendovych reseni lze vycist ze srovnavaci tabulky 1. 


Tab. 1 Srovnani dvou backendovych vrstev v systemu EasyMiner 


Vlastnosti 

Backend 

Limited 

Unlimited 

Prostredi 

R 

Apache Hadoop, Apache Spark 

Uloziste 

MySQL 

HDFS + Apache Hive 

Forma ulozenych dat 

radkove orientovane ta¬ 
bulky v RDBMS 

sloupcove orientovane tabulky 
v HDFS 

Cas vykonavani ulohy 

sekundy 

desitky sekund az desitky rninut 

Velikost dat 

do 100MB 

vice nez stovky MB 

Skalovatelnost ulohy 

ne 

ano (pocet uzlu x pocet jader) 

Paralelnl ulohy 

ano (zavisi na poctu vlaken) 

ano (zavisi na YARN planovaci) 

Prlstup dolovani 

in-memory 

distribuovane in-memory 

Algoritmus pro dolovani 
asociacnich pravidel 

apriori (knihovna arules) 

FP-growth (knihovna MLlib) 

Algoritmus pro tvorbu 
klasifikacnich modelu 

CBA 

CBA 


3 Zpracovani a dolovani dat 

Data lze do prislusneho uloziste nahravat skrze datovou sluzbu, ktera umoznuje prou- 
dove ukladani dat bud’ do MySQL pro limited rezim, nebo do HDFS pro unlimited 
rezim. Velikost nahravaneho datasetu neni v hadoop prostredi nijakjmi zpusobem ome- 
zena. Pro snazsi a rychlejsi zpracovani dat jsou data v HDFS ukladana do sloupcove 
orientovane podoby [ 1 ]. Diky takoveto reprezentaci lze provadet agregacni a joinovaci 
funkce napric vsemi sloupci jednou MapReduce ulohou (viz srovnani v Ttab. 2). 

V procesu predzpracovani dat dochazi k mapovani vsech hodnot na ciselne indexy, 
cimz dochazi k mirne kompresi dat a samotne dolovaci algoritmy mohou pracovat 
pouze s jednim datovym typem bez nutnosti reseni kodovani textu. 

Hledani asociacnich pravidel z mensich dat je primarne provadeno v prostredi R. 
Toto reseni je velmi rychle, avsak vyzaduje na vstupu kompletni databazi transakci 
ulozenou v pameti. V distribuovanem prostredi se pro dolovani asociacnich pravidel 
vyuziva knihovna Spark MLlib, konkretne algoritmus FP-growth [5], 

Tvorba klasifikacniho modelu z nalezenych pravidel je soucasti dolovaciho Spark 
programu, ktery implementuje algoritmus CBA [6]. Tento algoritmus prorezava nale- 
zena pravidla, za kterymi je pripojeno tzv. vychozi pravidlo. Z tohoto vystupu se pote 
da vytvorit klasifikator pro zvolenou cilovou promennou. 
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Tab. 2 Srovnani radkove a sloupcove orientovanych Hive tabulek pri pouziti v systemu Easy- 
Miner. Promenna N vyjadruje pocet sloupcu v tabulce. 


Operace 

Pocet MapReduce uloh 

Radkove orientovana ta- 

bulka 

Sloupcove orientovana ta- 
bulka 

Ukladani datoveho zdroje 

2N + 1 

2 

Cteni agregovaneho histogramu 

1 

1 

Tvorba datasetu 

1 

0 

Predzpracovani sloupcu 

N/5 + 3 

3 


4 Zaver 

Ackoliv je cloudova verze nastroje EasyMiner stale ve fazi vyvoje, je mozne ji pouzlvat 
pro testovacl a akademicke ucely bez jakychkoliv omezenl na MetaCloudu sdruzenl 
CESNET. Aplikace je schopna stabilne nahravat data a hledat v nich asociacnl pravidla, 
ze kterych lze sestavovat klasifikacnl modely. Budoucl vyvoj je v soucasnem stavu za- 
meren na zrychlenl dolovaclch algoritmu, implementaci diskretizacmch algoritmu a na- 
sazenl nastroje pro hledanl anomalil. V planu je take podpora dolovani pravidel z RDF. 

Podekovani: Tato prace vznikla za podpory Vysoke skoly ekonomicke v Praze pod 
grantem IGA 29/2016 a Fondu rozvoje CESNET pod grantem c. 540/2014. 
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Annotation: 

Association Rules Mining for Big Data in Cloud 

EasyMiner is a web service for association rules mining. A new version of this tool uses Apache 
Hadoop and Apache Spark for big data processing in the MetaCloud of the CESNET association. 
The application consists of several services for dataset uploading into HDFS, preprocessing, as¬ 
sociation rales discovery and classification based on associations. All services communicate with 
each other through REST APIs and form a complex software working as a service in the cloud. 
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Abstract. In recent years, human-ontology interaction becomes an increasingly 
important subject for computational and information systems developers. Human 
information consumers and web agents need to use and query ontologies using 
their web sites and web applications, thus the need for developing tool supporting 
ontological engineering and querying tools arises. In this paper we discuss the 
potential of web based human-readable ontology queries. In large taxonomies 
such Aviation Safety (AS) domain, it is important to allow easier navigation 
within the ontology. Thus we introduce an extension to the OntoQuery tool for 
more practical visualization of query results and easy OWL vocabulary dissemi¬ 
nation to the community. 

Contribution type: PhD Symposium 

Keywords: ontology, query, aviation safety 


1 Introduction 

In last years, ontology has been applied in a large number of areas in computer science. 
It is also used to refer to specific material domains (e.g., medicine, biology, aviation 
safety, etc.), resulting in domain ontologies. During ontology design and exploitation 
users need exploring ontology structure. Web-based tools are well suitable for this. Due 
to complex structure of ontologies, there is a need to pay attention to the development 
of human-ontology-interaction, and enhancing the querying ontology tools. 

In this paper we discuss how applying our OntoQuery extension to a specific domain 
ontology (aviation safety domain) could help this domain’s users and experts to explain, 
evaluate and exploit vocabularies in this domain. 

Section 2 presents motivations and related work. Section 3 discusses our extension 
to OntoQuery tool. In subsection 3.1 the use case used in our research is presented. 
Finally, Section 4 concludes this work. 
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2 Motivation and Related Work 

There are several different techniques and tools supporting user interaction with onto¬ 
logies. These tools help user and ontology experts to query and explore vocabularies 
and concepts. However most ontological engineering tools suffer from different prob¬ 
lems w.r.t interaction with human user, as discussed next. 

Protege-OWL [5] is a knowledge based ontology editor providing graphical user 
interface. It provides flexibility for meta-modeling and enables the construction of do¬ 
main ontologies. But some studies found that visualization options offered by Protege 
are too complex. Also, many users have difficulties with description logic (even though 
Manchester syntax is used). 

OntoQuery tool [2] introduces the OntoQuery web-based query utility. The inter¬ 
face of OntoQuery tool provides syntax highlighting similar to that provided by the 
Protege DL query tool. However, unlike Protege, OntoQuery highlighting distinguishes 
between classes and properties. As the user types, the system pops up a box with 
suggestions appropriate to the syntactic position within the query. The queries will re¬ 
turn all descendants (not just direct subclasses) matching the logical definition ex¬ 
pressed in Manchester syntax. 

OWLGrEd [4] the OWLGrEd Ontology Visualizer is an online tool for visualizing 
OWL ontologies using a compact UML-based notation. 

In our previous work, we based our aviation safety vocabulary explorer [9], which 
aims to visualize and explore the concepts (classes and relations) of aviation safety 
domain. It also helps aviation safety websites users to clearly understand the aviation 
safety vocabularies, in order to make their safety reports more efficient. We realized 
the importance of making navigation within large taxonomies easier for user. Thus we 
added extension to OntoQuery tool, which aims to categorize aviation safety domain 
into categories according to most general concepts, in order to facilitate the navigation 
within the aviation domain for domain’s users and experts. 

3 Extension to OntoQuery Tool 

Our plugin extension to OntoQuery tool [2] selects intentional classes from our doma¬ 
ins and corresponding conceptual (presented in section 3.1) using Protege DL query 
and named categories. It categorizes our ontology by adding isSubcategoryOf annota¬ 
tion property to each concept regarding to its category. When the user types his query 
in the client-side JavaScript input box, this query is sent to the server, which checks the 
syntax of the query, the translation of labels to IDs and the parsing of the query to OWL 
Manchester syntax are performed on the server. Then the query is executed to catego¬ 
rize ontology vocabularies according to the categories, that we selected by querying 
annotation property. However, this extension is only beneficial for large taxonomies 
(e.g., Aviation Safety ontology). Thus, simple domain-specific categorization of onto¬ 
logy terms allows easier navigation within the ontology. 

It is important and helpful for aviation safety agents to get details and good explana¬ 
tion about their input during searching in online safety websites. Thus, as a concrete 
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example of our work (see figure 1), when user uses aviation vocabulary explorer inter¬ 
face to search for some vocabulary terms (e.g., Airborne Object), it is very beneficial 
and helpful not only for users, but also for aviation safety systems and experts to give 
to user more information and explanation about his input by using ontology categori¬ 
zations (e.g., Airborne Object is related to Physical Object category, which seems un¬ 
derstandable for human recognition). 

Aviation Vocabulary Explorer 

Collision and has participant some Airborne object 

Quick Tips Examples Recent Queries Tutorial Results Q 
Filter Results 

Vocabulary Title Description 

Airborne object 
Collision 


Categories 


Fig 1 . Aviation safety explorer categorization 


3.1 3.1 Use case 

In this paper we consider the following domains and corresponding conceptual models: 

• A Conceptual Model representing the domain of aviation safety (1737 classes). It 
defines general well understood concepts in Aviation domain such as Aircraft, 
Flight, Agents and etc [10]. 

• A Conceptual Model that describes Eccairs taxonomies ontology (4067 classes). It 
aims to Improve air safety by bringing together the knowledge derived from the col¬ 
lection of incompatible occurrence reporting systems from various (member) States 
[ 8 ], 

• Unified Foundational Ontology (UFO), which is a top-level ontology. UFO is an 
ontology for specifications of domain ontologies and languages. It is divided into 
three layers: Object and Trope model part (UFO-A) [1], event model part (UFO-B) 
[6] and service model part (UFO-C) [7], 

We select our categories w.r.t the most general and understandable taxonomies in mo¬ 
dels that we mentioned above, we selected the most twenty (20) general concepts as 
categories. For example, we select: physical object and data categories which relate to 
aviation safety ontology. Event and trope categories relate to UFO concepts. 

4 Conclusion 

In this paper we discussed how developing query based on ontology (by extending On- 
toQuery tool), especially in very large taxonomies (e.g., aviation safety) could help do¬ 
main’s users and experts to navigate within ontology in very easy way. 
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Abstract. Molecular biology is a domain endowed by a good amount of data and 
well-formalized knowledge. Based on measured data and domain knowledge, an 
intelligent integrative analysis is capable of extracting new and more specific 
knowledge, which may help to comprehension of e.g. disease mechanisms. We 
have proposed miXGENE, a web service for integrating and analyzing high- 
throughput omics data, namely from the microarray-based expression or methyl- 
ation measurements, together with formal biological knowledge, such as gene 
ontologies and curated or predicted omics interactions. The tool enables building 
the most employed analytical workflows for processing user-data or the data from 
public databases. Processing of the data is followed by their integrative statistical 
or machine-learning based analysis, and completed with the presentation of re¬ 
sults in the expert-comprehensible terms. We propose an innovation of the tool 
which profits from the infrastructure of the Czech National Grid, CESNET - 
MetaCentrum, which facilitates the most computationally demanding sections. 


Contribution type: PhD Symposium 

Keywords: omics data, web service, machine learning 


1 Introduction 

One of the key paradigm in data-mining research and practice is integration of formal 
knowledge related to the investigated domain [1] into the analytical process. This 
knowledge incorporation is expected to help in making more precise models, interpret¬ 
ing the models and discovering new knowledge. In the other words, when having a 
loose notion what we are searching for, we can adjust (bias) our algorithm towards this 
particular knowledge. The resulting model would be more specific to the researched 
domain, and thus interpretable in the predefined terms and potentially unveiling a new 
knowledge yet more specific knowledge related to the researched problem. Last but not 
least, the more accurate models are expected from this paradigm, as the knowledge by 
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restricting the space of all conceivable hypothesis prevent overfitting and induces more 
generalization. 

One of the domains with a lot of generated data and well-formalized domain 
knowledge is molecular biology. Thanks to current technological progress in microar¬ 
rays and next-generation sequencing, we are being flooded with high-throughput meas¬ 
urements related to the genome, transcriptome and proteome. A neologism describing 
the measurements from all these biological sources is omics data. The genome is the 
set of individual's genetic equipment encoded in its DNA. The particular amount of 
genes transcript, which is further translated into protein is the fundamental process, 
called gene expression (GE), of migrating the biological information from DNA to the 
visible signs called phenotype. The knowledge linking all these processes and compo¬ 
nents is in the fonn of predicted or validated omics interactions, gene ontologies or 
curated canonical pathways and gene sets. By intelligent integration of these data types 
and by incorporating related knowledge we can extract a valuable nuggets which may 
help to comprehension of e.g. disease mechanisms or ordinating a personalized treat¬ 
ment. 

We have proposed miXGENE [2], a web service for integrating and analyzing high- 
throughput omics data, namely the microarray- based measurements of mRNA and mi- 
croRNA expressions, and the methylation assays, together with formal biological 
knowledge mentioned above. The expression data sets are possible to be uploaded by 
the user, or to be fetched from the public database NCBI GEO (National Center for 
Biotechnology Information - Gene expression Omnibus) [3], The knowledge is inter¬ 
nally represented as graphs of omics interactions or sets of the omics units, and may be 
uploaded by the user in a predefined canonical format. Otherwise, the user can choose 
default system-based knowledge sources, which had originated from the already cu¬ 
rated sources, namely from the gene ontologies (GO) [4], KEGG (Kyoto Encyclopedia 
of Genes and Genomes) pathways and other curated gene sets from the Molecular Sig¬ 
nature Database (MSigDB) [5], The tool is conceived as a workflow management sys¬ 
tem, which enables building the most employed analytical pipelines. Processing of the 
data is followed by their integrative statistical or machine-learning based analysis. 
The workflow is completed with the presentation of results in the expert-comprehensi¬ 
ble terms and visualization. 

To make the tool more effective, namely for large-scale bioinformatics experiments, 
we have migrated the most demanding segments of the computation to the infrastruc¬ 
ture of Czech National Grid. The tool is freely available atmixgene . felk. cvut. cz/. 

2 Related Work 

The workflow management systems (WMS), which miXGENE is an instance of, are 
growing area of research [6], The main purpose of those systems is: (i) to make the 
computational biology accessible for those researchers who are instructed informati- 
cians yet not programmers, (ii) to enable tracking of experimental history and offer a 
tool for testing different settings, and (iii) the possibility to exchange the scientific 
workflows. There are many general tools designed to represent bioinformatic or data- 
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analytic workflows; e.g., Tavema [7] or Galaxy [8]. miXGENE has been implemented 
as a specialized bioinformatics WMS. To facilitate the most computationally demand¬ 
ing workflows, we migrate the critical parts to the infrastructure of the Czech National 
Grid, CESNET - MetaCentrum [9], 

There are several bioinformatics tools, residing in the CESNET structures and inte¬ 
grating number of tools into a computational pipeline [10], [11], However, miXGENE 
is a tool based on independent server, which operatively migrates the most demanding 
computational segments into the CESNET infrastructure. 

3 System description 

3.1 Basic Architecture 

The tool can be split into three parts: 1) GUI (task definition, presentation of results), 
2) workflow management (task decomposition and its global planning in terms of 
the individual plugins) and 3) computational plugins (implementation of the individual 
analytical methods such as data normalization, feature extraction, learning of classifi¬ 
ers, etc.). Web interface and storage management are implemented in the web applica¬ 
tion framework Django, the workflow management is implemented in JavaScript and 
the computational plugins are mainly implemented in Python. 

With miXGENE, all experiments are built from components called blocks using in¬ 
teractive workspace. miXGENE defines two types of blocks: the proper-blocks and 
meta-blocks. Former represents particular atomic tasks. 

Each block represents one meaningful step in the experiment such as: 1) providing 
data source, i.e., user-uploaded or fetched dataset (set of measured expressions of 
genes) and/or a source of formal knowledge (interaction graph, curated gene sets or 
pathways); 2) preprocessing, analysis and creating the model itself; 3) presentation of 
the model and its results. The execution order is inferred from the data flow defined by 
binding the corresponding output and input ports of the consecutive blocks. 

The meta-blocks serve as containers of blocks or other meta-blocks. They generate 
their own scope of possible input variables. Actually, for a single sequence of blocks, 
the metablocks create alternative scenarios over multiple inputs, variables or parame- 
terizations. The main use of the meta-blocks is custom iteration over an user-defined 
collection of: a) data sets or knowledge sources, b) subsets of a data set or c) different 
analytical methods and their configurations. Alternating these variables contributes re¬ 
spectively for: (i) validating a method or an analytical workflow over as much data 
inputs as possible, (ii) validating the method for a particular dataset (i.e. cross-valida¬ 
tion) and (iii) assessing the reliability of a knowledge source, such as putative pathways 
or predicted omics interactions, which the user had previously acquired. 
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3.2 Migration of the computationally critical segments 

Formerly, a single experiment was executed serially. It was packed as a whole and sent 
to the miXGENE application server. In this innovation, we decompose the workflow 
into smaller, mutually independent tasks, pack them and send them to the grid. 

Particularly, the cross-validation meta-block consists of semantically independent 
sub-workflows (folds) of a common pattern. In our implementation, we pack the in¬ 
stances related to the sequence (sub-workflow) inside the cross-validation scope. The 
packed instances are then asynchronously sent to the grid nodes, where they are exe¬ 
cuted. The server is receiving the finished tasks and integrates them into a result con¬ 
tainer. This approach fits the map-reduce paradigm. 

4 Conclusion 

We propose an innovation of our workflow management system miXGENE. The sys¬ 
tem serves for easy-to-use construction of scientifical pipelines for bioinformatical use. 
The innovation lies in the effective migration of the most time-consuming segments of 
workflows to the infrastructures of Czech National Grid. 

Acknowledgment: The system miXGENE has been developed with a support of grant 
NT14539 of the Ministry of Health of the Czech Republic. The innovation of the sys¬ 
tem is granted by the CESNET Development Fund. Access to computing and storage 
facilities owned by parties and projects contributing to the National Grid Infrastructure 
MetaCentrum, provided under the programme "Projects of Large Research, Develop¬ 
ment, and Innovations Infrastructures" (CESNET LM2015042), is greatly appreciated. 
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Abstrakt. V tomto prispevku predstavujeme metodu na automatizovanu tvorbu 
faktickych otazok z textu, ktora vyuziva informacie o strukture vety a semanticke 
informacie o slovach. Na zaklade struktury vety vytvarame vetne vzory pomocou 
ktorych transformujeme deklarativne vety na otazky roznych typov. Semanticke 
informacie o slovach zlepsuju kvalitu vygenerovanych otazok. Vyznamnym vy- 
lepsenim oproti beznym pristupom je uchovavanie vzorov v hierarchii, co umoz- 
nuje spravovat’ vzory efektivnejsie a generovat’ otazky roznej urovne abstrakcie. 
Na zaver navrhujeme vyuzif strojove ucenie na vytvaranie a odvodzovanie no- 
vych vzorov vyuzivajucich syntakticku strukturu viet a semanticke kategorie 
slov. 

Typ prispevku: Doktorandske sympozium 

Kl’iicove slova: automatizovana tvorba otazok, vetne vzory, spracovanie textu 


1 Uvod 

Kvalita a mnozstvo vzdelavacich materialov dostupnych online neustale rastie. Vzde- 
lavanie pomocou tychto zdrojov sa tak stava coraz dostupnejsie. Overovanie ziskanych 
vedomosti pozostava z kontrolnych uloh, kde na zaklade odpovedi zistime, do akej 
miery student pochopil text resp. ma vedomosti, o ktorych sa v texte hovori. Keby sme 
mali nastroj, ktorym by sme vedeli z textu vygenerovat’ fakticke otazky na overenie 
vedomosti studenta, proces vzdelavania by sa v>razne automatizoval. 

Tvorba otazok z ucebneho textu bola zaradene medzi ulohy patriace pod oblast’ spra- 
covania prirodzeneho jazyka [6], V porovnani s pribuznymi ulohami sa radi k naroc- 
nejsim, kedze sa vyzaduje transformacia textu oboma smermi: z prirodzeneho jazyka 
do jazyka strojov a aj naspat’. Najskor must byt’ vstupny text transformovany do jazyka 
strojov - oblast’ porozumenia prirodzeneho jazyka (angl. Natural Language Understan¬ 
ding) a nasledne sa vytvorene otazky zobrazia v prirodzenom jazyku - oblast’ genero- 
vania prirodzeneho jazyka (angl. Natural Language Generation) [4]. Vstupom aj vystu- 
pom je teda text v prirodzenom jazyku. 
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V tomto prispevku nadvazujeme na nas predchadzajuci vyskum [2] [3], kde sme oplsali 
sucasny stav v oblasti automatickeho generovania otazok na zaklade analyzy vety. Za- 
meriavame sa na anglicky jazyk, ked’ze je tu moznost’ porovnania sa s podobnymi pra- 
cami. Nas pristup je mozne prisposobit’ aj na d’alsie jazyky, ale predpokladame, ze kva- 
lita otazok bude slabsia, ked’ze nastroje na anotaciu textu su najspol’ahlivejsie prave pre 
anglicky jazyk. Za najviac perspektlvne prlstupy sucasnosti sa javia metody zalozene 
na pravidlach a vzoroch v kombinacii s prlstupmi strojoveho ucenia. V nasledujucej 
sekcii strucne zhmieme poznatky a zistenia z aplikovania tychto prlstupov, ich hlavne 
nedostatky a d’alej predstavlme nase riesenie na zlepsenie procesu. 

2 Vyuzivanie struktury vety pri generovani otazok 

Automatizovane generovanie otazok sa stalo populamou oblast’ou. Prispeli tomu moz- 
nosti syntaktickej a semantickej analyzy textu, ktore su prlstupne pomocou viacerych 
nastrojov z univerzitneho prostredia (napr. skupina nastrojov zo Stanfordskej univer- 
zity 1 alebo strojovo-citatel’ny slovnlk Wordnet 2 ). Tie dokazu poskytnut’ mnozstvo in- 
formacil o strukture textu, naprlklad urcit’ hranice viet a slov, urcit’ slovne druhy, vy- 
tvorit’ syntakticky strom vety ci identifikovat’ nazvoslovne entity. Zlskane informacie 
o strukture vety a semanticke slov poskytuju perspektlvny zaklad pri analyze textu 
a kedze sa ich presnost’ neustale zlepsuje, pouzitie je perspektlvne. Vel’a sucasnych 
prlstupov vychadza z metod zalozenych na pravidlach a vzoroch (napr. [1] [4] [5]). V 
dizertacnej praci [4] vyuzlvaju mnozinu transformacnych pravidiel, ktore postupne ap- 
likuju na vety vstupneho textu na zaklade struktury vety. Najskor pravidlami zjedno- 
dusuje zlozite vety a nasledne zjednodusene deklaratlvne vety transformuju na opyto- 
vacie vety (otazky). V [1] tiez vyuzlvaju na transformaciu strukturu vety, ale samotna 
transformacia sa realizuje vjednom kroku. Pomocou orezania syntaktickeho stromu 
vety sa vytvorl kostra vety a aplikuje sa vzor, ktory vete vyhovuje. Spolocnym prvkom 
prlstupov je vyuzivanie struktury viet, ale obidva prlstupy zdiel’aju spolocny problem, 
ktorym je narocna rozslriternost’ vyzadujuca manualnu tvorbu pravidiel a vzorov. Aj v 
[5] preukazali, ze rozsirovanlm vzorov je mozne vygenerovat’ kvalitnejsie otazky, ale 
pocet vzorov vyrazne narastol. 

3 Generovanie otazok pomocou vetnych vzorov 

Vetne vzory v opisovanych pracach obsahuju sucasne slovne druhy aj nazvoslovne en¬ 
tity. Vzhl’adom na vel’ku rozmanitost’ viet (pocet kategoril jednotlivych tokenov) je pri 
ich kombinovanl potrebny vel’ky pocet vzorov, lebo kazdy pokryje len obmedzenu 
mnozinu viet. Preto v nasom navrhu vyuzlvame kombinaciu viacerych jednoduchyeh 
vzorov, ktore uchovavame v hierarchii. Jeden vetny vzor tvorl sekvenciu znaciek toke¬ 
nov na urcitej urovni, naprlklad sekvencia slovnych druhov alebo sekvencia nazvos- 


1 http: //stanfordnlp. github. io/CoreN LP / 

2 http://wordnetweb.princeton.edu/perl/webwn 
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lovnych entit. Abstraktnejsi vzor (vyssie v hierarchii) pokryje vacsi pocet viet, ale po- 
skytuje vseobecnejsie informacie o vete v porovnanl s konkretnejsimi vzormi, a preto 
aj otazky nlm vytvorene su viac vseobecne. Na najvyssej urovni hierarchie su ab- 
straktne vzory reprezentujuce len syntakticku strukturu vety a tie sa d’alej rozlisuju na 
konkretnejsie vzory, s ktorymi su prepojene (Obr. 1). 



Obr. I Hierarchia vzorov. 

Pre vetu, ktora vyhovuje vzoru na vyssej urovni (napr. The capital city of Slovakia is 
Bratislava) mozeme aplikovat’ vzor, ktory vyhovuje nielen slovnym druhom (prvy uzol 
hierarchie), ale ma aj zhodu v d’alsich tokenoch (napr. konkretny typ predlozky alebo 
nazvoslovnej entity). Generovanie otazokje na zaklade tejto kombinacie, cize keby sme 
uvazovali o vete, ktora ma rovnaky vzor na urovni slovnych druhov, ale rozdielne typy 
entit (napr. The current president of Slovakia is Andrej Kiska) pomocou specifickejsich 
vzorov vieme rozlisit’ typ otazky, ktora sa ma vytvorit’ (posledny token ma znacku 
osoba, nie lokalita). Tymto sme zabezpecili efektivnejsie vyhl’adanie vzoru postupne 
od vseobecnejsich ku specifickym a umoznili vzory rozsirovat’s nizsou vypoctovou 
zlozitost’ou ich mapovania - v prvom kroku sa vyberie podmnozina vzorov splnajuca 
zakladne kriteria a tieto su nasledne aplikovane v procese tvorby otazok. Zaroven je 
mozne v buducnosti vzory doplnat’ o d’alsie parametre (napr. semanticke kategorie 
vyznamovych slov) a tym zlepsovat’ pokrytie roznych viet. 

Druhym rozsirenim vyuzivania vzorov je zakomponovanie metod strojoveho ucenia 
pri vyhl’adavani a odvodzovani novych vzorov. K dispozicii mame jednak informacie 
o pozicii jednotlivych tokenov, slovnych druhov a nazvoslovnych entit a zaroven data- 
bazu transformacnych vzorov (pravidla, ako sa maju deklarativne vety zmenit’ na 
otazky). Vyber pravidla na tvorbu otazky sa uskutocnuje na zaklade zhody resp. po- 
dobnosti medzi vzormi. Pri podobnosti sa zohl’adnuje nielen pocet zhodnych znaciek 
tokenov, ale aj moznost’ ich vzajomnej zameny: niektore znacky tokenov (napr. pod- 
statne meno a zameno) su navzajom l’ahsie zamenitel’ne v porovnani s inymi dvojicami 
(napr. podstatne meno a sloveso). Do vypoctu podobnosti teda v sucasnosti zahrnujeme 
zhodu znaciek, podobnost’ znaciek a nahraditel’nost’ tokenov na jednotlivych urov- 
niach. 

Znacky tokenov vyuzivame aj pri uceni resp. trenovani. Inicialna mnozina transfor¬ 
macnych vzorov bola naucena na zaklade existujucich dvojic veta-otazka. Podobny pri- 
ncip sa da vjoizit’ aj pri vylepsovani algoritmu pouzitim tzv. ucenia s posilnovanim 
(angl. reinforcement learning), kedy na pravdepodobnost’ aplikovanie konkretneho 
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vzoru scasti vplyva aj informacia a uspesnom pouzitl vzoru v predchadzajucich pripa- 
doch. Ak sa vytvori otazka, ktora nebude akceptovana, pouzity vzor sa v buducnosti 
aplikuje s mensou pravdepodobnost’ou. 

4 Zaver 

V prispevku sme nacrtli sucasny stav automatizovanej tvorby otazok pomocou pravi- 
diel a vzorov a dve rozslrenia, ktorymi sa snazlme tento pristup vylepsit’. Prvym je 
uchovavanie vzorov v hierarchii vratane vzt’ahov medzi nimi, vd’aka comu je mozne 
vzory efektlvnejsie vyhl’adavat’ a spravovat’ tak vacsie mnozstvo roznych typov vzorov. 
Ani to uplne nevyriesi rozmanitost’ viet, pri ktorej je potrebne pocet vzorov zvacsovat’, 
akchceme pokryt’ viac typov viet. Preto sa pokusame vyuzit’ metody strojoveho ucenia, 
ktore budu vzory odvodzovat’ a vytvarat’ na zaklade podobnych eft tokenov, ktorymi 
su slovne druhy a nazvoslovne entity. V buducnosti planujeme zohl’adnit’ aj syntakticke 
vzt’ahy medzi slovami obsiahnute v syntaktickom strome vety, semanticke kategorie 
vyznamovych slov a kategorie slov na zaklade konceptu prepojenych dat. 

Pod’akovanie: Tento clanok vznikol vd’aka podpore v ramci OP Vyskum a vyvoj pre 
projekt: Medzinarodne centrum excelentnosti pre vyskum inteligentnych a bezpecnych 
informacno-komunikacnych technologii a systemov, ITMS 26240120039, spolufinan- 
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Annotation: 

Leveraging Sentence Structure for Transformation into Question 

In this paper we propose method approach to enhance and extend template-based method for 
automatic question generation. Actual approaches showed that template-based methods are per¬ 
spective to solve this problem but they have some limitations. Main problems consist in ability 
to extend templates for various types of sentence structure and matching large amount of patterns 
to these sentences. We use multiple simple patterns stored in hierarchy which makes pattern 
matching easier. Although the number of patterns for covering various sentences will grow rap¬ 
idly and these patterns must be created manually, we proposed to use machine learning for crea¬ 
tion and deviation of new patterns. Learning leverages sentence structure parameters and seman¬ 
tic information about words (e.g.: part-of-speech tags, category of named entities and we also 
consider to take into account super sense tags and linked data concept). 
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Abstrakt. V praci sa zaoberame vyskumom biometrickych charakteristik spra- 
vania sa pre mobilne zariadenia ako pouzivatel’sky prijatel'nejsiu formu identifi- 
kacie a autentifikacie. Problemom je nedostatocna presnost’ biometrickych sys- 
temov pre licely identifikacie a autentifikacie pouzivatel’a, ako aj vonkajsie 
vplyvy (napr. polohy tela pouzivatel’a pri ovladani zariadenia) v dosledku mobi¬ 
lity pouzivania mobilnych zariadeni, ktore mozu d’alej znizovat’ presnost’ syste- 
mov. Ciel’om prace je vytvorit’ taky model pouzivatel’a, ktory sa dokaze vyspo- 
riadat’s vonkajsimi vplyvmi, teda udrzat’ presnost’ pri roznych vonkajsich vply- 
voch. Doposial’ sa nam podarilo zistit’ skutocnost’, ze biometricke charakteristiky 
tlaku a casu vykonania gesta na dotykovych obrazovkach sa lisia pre jedneho 
pouzivatel’a v roznych telesnych polohach. 

Typ prispevku: Doktorandske sympozium 

Kl’iicove slova: behavioralne biometriky, modelovanie pouzivatel’a, vonkajsie 
vplyvy 


1 Uvod 

Pouzlvatelia mobilnych zariadeni, najma smartfonov, casto nedbaju na svoju bezpec- 
nost’ napr. pri uzamykani smartfonu a pre zvysene pohodlie volia slabsie hesla alebo 
vzory odomykania, ktore je mozne vycitat’ zo stop zanechanych na dotykovej obra- 
zovke. Na zvysenie bezpecnosti pri zachovani urovne pouzivatel’skej privetivosti iden¬ 
tifikacie a autentifikacie pouzivatel’a dokazeme vyuzit’ biometricke charakteristiky 
spravania sa (behavioralne biometriky, d’alej len „biometriky“), t.j. vzory spravania sa 
jedinecne pre kazdeho pouzivatel’a. 

Biometricka identifikacia a autentifikacia pouzivatel’a je len jednou z moznych uloh, 
ktore prostrednictvom biometrik dokazeme realizovat’. Vo vseobecnosti hovorime, ze 
system modeluje pouzivatel’a na zaklade biometrik, teda vytvara si model pouzivatel’a 
[5]. Na realizaciu nami stanovenej ulohy v systeme pomocou biometrik (ako napr. iden¬ 
tifikacia a autentifikacia) system potrebuje najprv zaznamenat 'data - napr. poloha prstu 
na dotykovej obrazovke alebo tlak vyvijany na dotykovu obrazovku. System nasledne 
data predspracuje - napr. data rozdelime na vzorky (napr. gesta). Z predspracovanych 
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dat ziskame biometriky (priemerny tlak na dotykovu obrazovku, cas vykonania gesta, 
a pod.), u ktorych predpokladame dobru rozlisovaciu schopnost’ pre dany typ ulohy. 
Ziskane biometriky pri prvotnom zaznamenavani predstavuju sablony. Dalsie ziskane 
biometriky system pouzije na porovnavanie s ulozenymi sablonami a vyberie sablonu 
s najlepsou zhodou. Na porovnavanie sa vyuzivaju vzdialenostne metriky, statisticke 
metody alebo metody strojoveho ucenia (k najblizsich susedov, mechanizmus podpo- 
rnych vektorov a pod.). 

Biometricky system nedokaze vzdy vybrat’ spravnu sablonu. Ako mieru uspesnosti 
systemu najcastejsie pouzivame mieru chybneho prijatia (FAR), mieru chybneho od- 
mietnutia (FRR) alebo mieru, pri ktorej FAR a FRR su rovnake (EER), ktoru dosiah- 
neme nastavenim prahovej hodnoty pre vybranu metodu porovnavania sablon. 

Vedecke studie v oblasti biometrie smartfonov sa zaoberali predovsetkym vy- 
skumom biometrik a ich presnost’ou v systemoch za ucelom autentifikacie pouzivatel’a. 
V Tab. 1 sa nachadza prehl’ad biometrik kategorizovanych podl’a cinnosti. 


Tab. 1 Prehl’ad skumanych biometrik 


Skupina biometrik (cinnost) 

Casto ziskavane biometriky 

EER 

Dynamika stlacania grafic- 
kych objektov na virtualnej 
klavesnici [ 1 ][7] 

Cas stlacenia klavesy, medzi stlace- 
niami klaves, priemerny tlak na obra¬ 
zovku 

3-12% 

Gesta na dotykovej obra- 
zovke (t’ah, priblizenie, 
a pod.) [4] 

Cas vykonania gesta, zaciatocny a kon- 
covy bod gesta, priemerny tlak, smer 
gesta 

4-14% 

Chodza [6] 

Statistiky z akcelerometra (priemer, mi¬ 
nimum, maximum), dlzka jedneho 
cyklu chodze 

8-28% 

Gesta s pohybom smartfonu 
(prijatie hovoru, dvihanie 
smartfonu) [2][3] 

Statistiky z akcelerometra a gyroskopu, 
dlzka jedneho cyklu chodze, podobnost’ 
ziskaneho a sablonoveho pohybu, 

8-20% 


Na zaklade chybovosti EER v studiach je mozne skonstatovat’, ze pre ulohy biometric- 
kej identifikacie a autentifikacie neposkytuju biometriky dostatocnu presnost’. Vzhl’a- 
dom na povahu pouzivania mobilnych zariadeni vplyvaju na presnost’ vonkajsie vplyvy 
ako napr. polohy tela pouzivatel’a (sediaci, stojaci, leziaci) alebo prostredie (exterier, 
interier). Ciel’om prace je navrhnut’, implementovat’ verifikovaf taky model pouziva- 
tel’a, ktory sa dokaze prisposobit’ vonkajsim vplyvom, t.j. dokaze udrzat’ dostatocnu 
presnost’ pri roznych vonkajsich vplyvoch. 

2 Model pouzivatel’a prisposobeny vonkajsim vplyvom 

Na Obr. 1 je znazorneny vseobecny proces modelovania pouzivatel’a pre identifikaciu 
a autentifikaciu. Ako prvotny navrh riesenia pre vysporiadanie sa s vonkajsimi vplyvmi 
je defmovanie samostatnych sablon pre jednotlive vonkajsie vplyvy pre kazdeho pou¬ 
zivatel’a zvlast’. Problemom tohto riesenia je vel’mi vel’ke mnozstvo sablon, pre ktore 
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je potrebne ziskavat’ potrebne vzorky. Dalsim problemom je mozna podobnost’ niekto- 
rych sablon navzajom. 


2. a 

Vstupne zariadenia 



Uspech Neuspech 


Obr. 1 Prvotny navrh modelu pouzlvatel’a pre identifikaciu a autentifikaciu 
so zohl’adnenim vonkajsich vplyvov 

Pre riesenie problemov s navrhnutym modelom sa potrebujeme zaoberat’jednotlivymi 
biometrikami - ci sa menia alebo nemenia vzhradom na rozne vonkajsie vplyvy. Po- 
dobnost’ jednej biometriky medzi vonkajsimi vplyvmi uskutocnujeme porovnavamm 
parov hodnot pre jednotlive sablony pre jedneho pouzivatel’a. Podobnost’ vyhodnocu- 
jeme pomocou statistickeho testu (v nasom pripade t-testu), pricom si stanovime prah 
podobnosti a pocet vonkajsich vplyvov, kedy je dana biometrika podobna. 

Na zaklade riesenia problemu sme vykonali prvy experiment so 43 ucastnikmi, 
v ktorom sme skumali, ake charakteristiky sa menia pri roznych polohach tela pri vy- 
konavani jednoduchycht’ahov na dotykovej obrazovke smartfonu. Ziskali sme dovedna 
11 biometrik tykajucich sa tlaku na dotykovu obrazovku, trajektorie gesta, koncovych 
bodov gesta a casu vykonania gesta. Na zaklade experimentalnych vysledkov sme zis- 
tili, ze priemerny a maximalny tlak na dotykovu obrazovku rozlisovalo polohy pre viac 
ako 50% pouzivatel’ov, cas vykonania gesta pre viac ako 22% pouzivatel’ov a casovy 
okamih s najvacsou odchylkou trajektorie gesta od vzdialenosti koncovych bodov gesta 
pre viac ako 18% pouzivatel’ov. 

3 Zaver 

V praci sa zaoberame vyskumom biometrickych charakteristik spravania sa pre mo- 
bilne zariadenia za ucelom identifikacie a autentifikacie pouzivatel’a. Nedostatocnu 
presnost’ biometrickych systemov dokazeme znizit’ ziskavanim d’alsich biometrik z do- 
posial’ malo preskumanych cinnosti. Vacsi problem pri pouzivani mobilnych zariadeni 
vsak predstavuju vonkajsie vplyvy (napr. polohy tela), ktore takisto mozu znizovat’ 
presnost’ systemov. 
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Na zaklade problemu s vonkajsimi vplyvmi sme navrhli model pouzlvatel’a, ktory 
sa dokaze vysporiadat’s vonkajsimi vplyvmi. Doposial’ sa nam podarilo zistit’ skutoc- 
nost’, ze biometriky tlaku a casu vykonania gesta na dotykovych gestach sa llsia pre 
jedneho pouzlvatel’a v roznych telesnych polohach. 

Pod’akovanie: Tato publikacia vznikla vd’aka ciastocnej podpore projektu Prisposobo- 
vanie prlstupu k informacnym a vedomostnym artefaktom zalozene na interakciach a 
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veka v digitalnom priestore, grant APVV-15-0508. 
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Annotation: 

User Model for Identification 

Our work deals with the research of behavioral biometrics for mobile devices as a more user- 
convenient form of user identification and authentication. One of the problems in the research 
area is the relatively low accuracy of biometric systems for the purposes of identification and 
authentication, as well as external factors (such as body postures of a user) due to the mobility of 
the users. The goal of our work is to create a user model that can cope with the external factors 
and maintain a reasonable level of accuracy given the external factors. So far we have discovered 
that touch pressure and length of simple swipes on a smartphone touch screen vary in different 
body postures in each user individually. 
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Abstract. The common way of OWL ontology development for semantic web is 
to create and work with the ontologies directly in the RDFS/OWL language. That 
might make the task harder than it could be, since OWL allows to encode the 
same real world situation using different combinations of language constructs. 
We propose splitting the ontology development into two steps. First, the relevant 
entities and relationships are described in an ontological background model and 
then the encoding style is chosen for each entity in the second step. Finally, the 
seed of an OWL ontology is generated automatically from the background model. 
The PhD thesis focuses on development of visualization and transformation 
methods and their implementations as graphical tools that will allow to test the 
proposal with users. 


Contribution type: PhD Symposium 

Keywords: ontology engineering, OWL, ontological background models 


1 Introduction 

Ontologies used as data schemas to describe the data on the semantic web are its essen¬ 
tial component. The common way of ontology development is to work with the onto¬ 
logies directly in the Web Ontology Language 1 (OWL). That might make the task har¬ 
der than it could be, since OWL allows to encode the same real world situation using 
different combinations of language constructs, following different encoding styles, 
which might affect the suitability of the resulting ontology for various use cases. The 
engineer has to deal with two problems at the same time: defining what concepts are in 
the modeled domain and choosing the OWL encoding style for them. We propose split¬ 
ting the ontology development into two steps. First, the relevant concepts are described 
in an ontological background model, analogical to an entity-relationship diagram in 
database design, and then the encoding style is chosen for each entity in the second 


1 https://www.w3.org/TR/owl2-primer/ 
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step. Finally the seed of an OWL ontology is generated automatically from the 
background model. Similar issue is at the side of the ontology user. When users want 
to learn how to use terms from the ontology to describe their data, their only options 
are to look directly at the ontology in OWL or ontology documentation, which is often 
not provided and rarely visual. That might make the task difficult, since the source 
OWL representation of an ontology shows what terms can be used, but not how, in what 
combinations, they should be used. The “how to use” visualization could be achieved 
by summarizing a dataset where the ontology is already in use, leading to basically 
learning by example. The PhD thesis focuses on development of visualization and tran¬ 
sformation methods and their implementation that will allow testing the proposal with 
users. The idea is illustrated by Figure 1. Simply said, the goal is making the work of 
ontology engineers and users easier by adding another interface layer between them 
and the ontology in its source form. 


a) Current common state in semantic web: both ontology engineer and the 



b) Proposed state: the ontology engineer creates a background model first, the 
ontology user is provided with ontology usage visualization as an aid 


X 

Ontology engineer 


ontological background model 
created graphically 


OWL ontology shown 
in ontology editor 


automatic 

summarization 




ontology usage 
visualization 



Fig. 6. Thesis proposal (b) compared to current state in ontology engineering (a) 


2 State of the art 

The problem of heterogeneity of ontologies is targeted by a whole research area of 
ontology mapping [6], It aims at enabling usage of a combination of different ontolo¬ 
gies, however, it is not concerned with the encoding style heterogeneity. Meta-mode- 
ling approaches might allow abstracting from the OWL encoding differences. PURO 
ontological background models (OBM) [7] allows modeling a part of reality in a repre¬ 
sentation that relaxes some of the constraints imposed to OWL by its description logic 
grounding and can be mapped to different OWL encoding styles. A similar meta-mo¬ 
deling approach offers OntoClean [3], which however only focuses on classes in a ta¬ 
xonomy and its intended usage is for coherence testing. OntoUML [1] is a version of 
UML for conceptual modeling where the modeling primitives are grounded in concepts 
of a foundational ontology. That allows validation of the models against syntactical 
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errors and application of ontological design patterns. OLED [2], a graphical editor for 
OntoUML, allows to transform it into OWL fragments. The transformation is hard¬ 
coded and each OntoUML element has its single OWL counterpart - encoding style 
heterogeneity is not considered. PURO language seems to be the most promising way 
of representing of the ontological background models thanks to it being mentally close 
to OWL and was therefore chosen for the proposal implementation. 

Several dataset summarization tools usable to study ontology usage in a dataset exist. 
The main problem with them is that all of them stayed in a very experimental stage of 
development and are not publicly available. The same principles as are proposed in the 
thesis uses maps of ontology usage [5], ExpLOD [4] offers a more complex approach 
based on bisimulation contraction. The result is a node-link visualization similar to 
what we propose but more accurate: showing a combination of links that reportedly 
exist in the dataset while we show combinations of links that possibly exist. Our visu¬ 
alization might be on the other hand more intuitive as it shows types of instances di¬ 
rectly as node labels while ExpLOD shows types as separate nodes which might lead 
to clutter. 

3 Achievements so far 

Visual authoring of ontological background models in a web-based tool PURO Mode¬ 
ler 2 and their transformation to OWL in OBOWLMorph 3 has been developed and pre¬ 
liminary evaluated with users. A tool for visualization of ontology usage as combina¬ 
tions of types and properties in a graph, LODSight, 4 has been developed, but has not 
been tested with users yet. An important starting point for the research lies in ontology 
visualization. Therefore, a comprehensive survey of ontology visualization tools, inclu¬ 
ding an updated classification of visualization methods, has been done. 

4 Evaluation 

We have evaluated PURO Modeler and OBOWLMorph concept with a group of ten 
students with basic knowledge about OWL from a course of ontology engineering. The 
aim was to compare ontology engineering in our tools to common ontology editor Pro¬ 
tege. 5 The hypthesis was that PURO-started development allows beginner-users to cre¬ 
ate an ontology appropriately covering the domain with less effort than common onto¬ 
logy development in Protege. The students were assigned to create a PURO model and 
an ontology in Protege according to a textual description of the model and were given 
a questionnaire afterwards. Very brief overview of the evaluation results follows: mo¬ 
deling in PURO is a little bit slower and more error prone (partially due to less strict 
UI), however leads to better coverage of the domain. Both the time consumption and 


2 http://protegeserver.cz/puromodeler/ 

3 http://protegeserver.cz/puromodeler/OBOWLMorph/ 

4 http://lod2-dev.vse.cz/lodsight-v2/ 

5 http://protege.stanford.edu 
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number of errors is quite similar in OWL and PURO. According to the questionnaire, 
students prefer PURO Modeler over Protege. They consider PURO to be rather easy to 
learn and did not hesitate much about which PURO construct to use for each entity. 
Given that PURO Modeler and OBOWLMorph are at early stage of development and 
their UI is not very user friendly yet, the results are quite encouraging. LODSight has 
been so far tested only from the technical point of view. Evaluation with users is plan¬ 
ned as future work. 

5 Conclusion 

We have proposed a method for starting ontology development from ontological 
background models exploiting the existing PURO language. The method has been im¬ 
plemented and evaluated with users. The evaluation suggests the tools are usable, but 
need much improvement. A tool for visualization of existing ontology usage, useful for 
both ontology developers considering reuse of existing ontologies and ontology users, 
has been developed but not yet tested with users. The tool builds on existing methods, 
adds new features and improves ease of use compared to similar existing tools. The 
future research will aim at evaluation of the whole framework with users, impro¬ 
vements based on it, and integration of a semi-automated way of reusing concepts from 
existing ontologies into the PURO-started ontology development process. 

Acknowledgment: This research is supported by UEP IGA F4/28/2016. 
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Abstrakt. Pouzivatel’ske studie v oblasti Webu su zalozene na niektorych metri- 
kach, ale otazkou je, ako tieto metriky aplikovat’ na vacsiu skupinu pouzivatel’ov 
(licastnikov studie). Ked’ uvazujeme, ze kazdy ucastnik ma rozne vlastnosti, sku- 
senosti a zrucnosti, ocakavame tiez, ze vysledky studii v rovnakom prostredi 
budu mat’ rozne hodnoty. Zameriavame vyskum na ukazanie, ze z kvantitativ- 
nych studii mozeme ziskat’ presnejsie vysledky, ak zoberieme do uvahy aj jed- 
notlive vlastnosti o osobnych charakteristikach licastnikov. 

Typ prispevku: Doktorandske sympozium 

Kl’iicove slova: kvantitativne studie, individualne rozdiely, pouzivatel’sky zazi- 
tok 


1 Uvod a motivacia 

Pouzivatel’ske studie pomahaju zamerat’ sa na konkretny problem, napriklad pri navrhu 
dizajnu alebo overovani pouzitel’nosti. Niekedy je vhodne vyuzit’ vacsiu vzorku 
licastnikov. Ked’ uvazujeme, ze kazdy pouzivatel’ ma rozne schopnosti, zrucnosti a skii- 
senosti, ocakavame tiez, ze vysledky testovania sa budu lisit’. Vysledky mozu byt’ 
ovplyvnene viacerymi vplyvmi, niektore z nich uz boli identifikovane. 

Pouzivatel’ske studie delime na kvalitativne a kvantitativne. Zatial’ co kvalitativne 
studie zvycajne pozostavajii z interakcie ucastnika v danom prostredi za ucasti mode- 
ratora, ako doleziteho sprostredkovatel’a, kvantitativne studie su zvycajne vykonavane 
bez neho a teda bez hlbsej analyzy konkretneho pouzivatel’a. Pri kvantitativnych stii- 
diach sa presiivame zo specifickych detailov ku generalizovanej informacii pre celii 
skupinu licastnikov. 

Vyhodnotenie pouzivatel’skeho testovania moze byt’ presnejsie s dodatocnou infor- 
maciou o ucastnikovych zrucnostiach, ako napriklad Webova alebo Pocitacova gramot- 
nost’. Pojem Webovej gramotnosti sa pocas rokov menil a v siicasnosti ho vystihujii 
aspekty: citanie, pisanie a ziicastnenie sa, niekedy oznacovane ako oblasti: skiimanie, 
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tvorenie a spajanie 1 . Spolu potom obsahuju V nasej praci sa snazime odhalif zakladne 
vzt’ahy medzi Webovou gramotnost’ou a pracou vo webovom prostredi. Prave rozdiely 
v pouzivani Webu jednou zo skupin (ucastnici s vysokou alebo nizkou Webovou gra¬ 
motnost’ou) nas mozu nasmerovat’ k lepsiemu porozumeniu zakladnych principov. 

2 Suvisiace prace 

Nasa pozornost’ sa upriamuje na studie, ktore vyhodnocuju interakciu pouzivatel’a so 
zameranim na individualne rozdiely ucastnikov v kvantitativnych studiach. Napriek dl- 
hodobemu vyskumu v oblasti pouzitel’nosti, sa tejto teme venuje len okrajovo. 

Individuality ako unikatnost’ voci ostatnym, popisuju mnohe psychologicke a me- 
dicinske studie [5], Ukazuje sa, ze prave individualita ma vel’ky vplyv na vysledky stu¬ 
dii. Zakladne vplyvy tvoria psychologicke crty (neurotizmus, extraverzia, otvorenost’, 
privetivost’, svedomitost’) [7] zvacsa modelovane pomocou dotaznikov. Pocas mno- 
hych rokov vyskumu vznikli nastroje, schopne identifikovat’ psychologicke crty jedno- 
duchsie, napriklad z formy pisaneho textu. Postupne sa rozvija vyskum vplyvu veku 
a pohlavia, niekedy obohatene o skumanie skusenosti v danej oblasti ci vzdelania [6], 
Testovanie vplyvu pohlavia neukazuje jednoznacny vplyv. Niekedy sa nevyskytuju od- 
lisnosti vnimania kvality sluzieb ci informacnej kvality [7], ale ukazuju sa rozdiely vni- 
mania kvality obsahu [3], Vysledky su podl’a vsetkeho zavisle od d’alsich faktorov. Pri 
testovani ziakov sa rozdiely medzi pohlaviami ukazali napriklad pri vnimani grafickej 
a textovej informacie, ktore sa pripisuju lepsim jazykov^m vlastnostiam dievcat. Obja- 
vuju sa tiez rozdiely vo vyhl’adavacich vzoroch [1], Dalsia studia [2] vyuzila sebahod- 
notenie pouzivatel’ov v otazke Internetovej gramotnosti v zavislosti od doby pouziva- 
nia socialnych medii. Z inych studii vieme, ze vyuzivanie subjektivneho hodnotenia nie 
je vel’mi presne a zaoberame sa nesubjektivnym vyhodnotenim. Testovanie Webovej 
alebo Digitalnej gramotnosti dnes poskytuju najma spolocnosti, ktore vedu aj vzdela- 
vanie vtychto oblastiach. Forma testovania je rozna: od aplikacii, ktore bezia na 
desktopoch az po online dotazniky. 

3 Otvorene problemy a vyskumne dele 

Aktualnym problemom je, ze pouzivatel’ske studie neberu do uvahy individualne vlast¬ 
nosti pouzivatel’ov a teda ani Webovu gramotnost’. Nasim zaujmom je vyskum vplyvu 
individualnych vlastnosti ucastnikov studii na tieto studie. Momentalne sa zaujimame 
o vzt’ah Webovej gramotnosti a miery do ktorej moze ovplyvnit’ vysledky experimen- 
tov. Ciel’om je navrhnut’ metodu na detekciu individualnych rozdielov. 


1 https://wiki.mozilla.org/Learning/WebLiteraciesWhitePaper 
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4 Metoda na urcovanie Webovej gramotnosti 

Navrhujeme vlastne specificke riesenie pozostavajuce zo serie testov na detekciu cha- 
rakteristik, konkretne Webovej gramotnosti ako sucasti Digitalnej gramotnosti. Navrh 
sa zaklada na schopnosti spravne urcit’ rozhodovacie testy pre d’alsiu automatizaciu. 
Zakladny motiv je porovnavanie ucastnikov s vyssou gramotnost’ou a ucastnikov s niz- 
sou gramotnost’ou. Predpokladom je ziskanie informacii z interakcie ucastnika na 
Webe. Webova gramotnost’ pouzivatel’a je urcovana pomocou testu pozostavajuceho 
z troch casti. 

Prva cast’ skuma gramotnost’ explicitne pomocou dotaznika. Ucastnici nehodnotia 
sami seba, ale odpovedaju na kviz zlozeny z relevantnych otazok. Tuto cast’ sme reali- 
zovali dotaznikmi Google Forms, kde sme pouzili 14 otazok a 4 dostupne odpovede. 

Dalsia cast’ zist’uje znalost’ webovych ikon bezne pouzivanych na webstrankach. 
Ucastnik dokazuje, ako dobre sa v nich vyzna, pomocou vobby vhodnych ikon (napr. 
„menu“ alebo „poslat’ e-mail“) na zaklade otazky. Takto sme testovali 15 charakteris- 
tickych webovych ikon. 

V tretej casti hl’adame zakladne vzory v hl’adani oblasti na webstranke, znova v za- 
vislosti od gramotnosti ucastnikov. Ucastnici maju za ulohu oznacit’ miesto na obra- 
zovke, kde ocakavaju vysk}4 dopytovaneho elementu. Planujeme porovnat’ zavislosti 
oznacenej pozicie elementu a urcenej gramotnosti. Miesto realnych stranok im zobra- 
zujeme vizualne upravenu schemu, ako na obrazku 1. Tieto schemy obsahuju informa- 
ciu o type stranky „znacka elektroniky", „fakultna webstranka“, „filmova databaza“ 
a pokyn, aky prvok maju hl’adat’ (napr. „nakupny kosik“ alebo „vyhl’adavanie“). 

Pilotny experiment sme vykonali v UX Labe na FIIT STU za pomoci okulografu 
Tobii TX300. Infrastruktura [4] umoznuje zber informacii z webkamery, obrazovky, 
okulografu a pouziteho weboveho prehliadaca. 

Testovanie pouzivatel’ov prebiehalo na dial’ku, v stanovenom poradi testov a nahod- 
nom poradi uloh v nich. 



Obr. 1 Ukazka vzoru s mriezkou a pozlciami oznacenymi ucastnlkmi. 
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5 Zaver 

Webovu gramotnost’ povazujeme za kl’ucovy aspekt dnesnej interakcie cloveka s poci- 
tacom. Zatial’ existuje iba obmedzeny pocet pristupov na urcovanie Webovej gramot- 
nosti, preto navrhujeme vlastnu metodu. Aktualne nastavenie experimentov ukazalo, ze 
je treba zaviest’ vhodnejsie a detailnejsie sledovanie interakcie ucastnika pre ziskanie 
relevantnych vysledkov. V d’alsej praci sa teda zameriame na automatizovane vyhod- 
nocovanie stupna Webovej gramotnosti na zaklade detailnejsej informacie o interakcii 
ucastnikov v danom prostredi. 

Pod’akovanie: Tato publikacia vznikla vd’aka ciastocnej podpore projektov VEGA 
1/0646/15 a HIBER APW-15-0508. 
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Annotation: 

Impact of Characteristics of Individuals on Evaluating the Quantitative Studies 

Usability studies in the web domain are based on various metrics, but the question is how to apply 
these metrics to evaluate a larger group of people. When we consider that every user has different 
qualities, skills and experiences, we could expect that the results of testing of same scenarios will 
be different. We aim our research to show that quantitative studies can provide more accurate 
results if we work with information about personal characteristics of participants. We have alre¬ 
ady conducted a preliminary controlled experiment on a small sample of participants, which ex¬ 
plores influence of a Web literacy. 
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Abstrakt. Rozvojom a zavadzanim inteligentnych meracov (Smart Meters) do- 
chadza k zhromazd’ovaniu novych typov dat. Tieto data poskytuju informacie o 
aktualnej spotrebe a priebehoch odberov jednotlivych odberno-odovzdavacich 
miest. Ziskane data otvaraju raoznosti na vytvaranie novych modelov s ciel’om 
spresnit’ predikciu spotreby elektrickej energie. Vzhl’adom na obmedzene moz- 
nosti vyroby a uskladnovania vyrobenej el. energie, je tento problem stale vysoko 
aktualny. V praci je predstavena metoda kombinujuca vysledky predikcie viace- 
rych predikcnych modelov. Uvedeny pristup sa v literature nazyva Ensemble 
Learning. Dolezitou sucast’ou metody je sposob kombinacie ciastkovych vysled- 
kov do fmalnej predikcie. Tento zlozity numericky problem riesime pomocou 
biologicky inspirovanych algoritmov, ktore dokazu v konecnom case a pri po- 
merne nizkych vypoctovych narokoch, poskytnut’ optimalne riesenie. 

Typ prispevku: Doktorandske sympozium 

Kl’iicove slova: casove rady, predikcia, ucenie suborom metod, biologicky in- 
spirovane algoritmy 


6 Uvod 

V sucasnosti existuje niekol’ko desiatok roznych predikcnych metod, urcenych na pracu 
s casovymi radmi. Vo vseobecnosti mozeme tieto metody rozdelit’ do troch hlavnych 
skupin [7]: 

• tradicne metody (regresia, viacnasobna regresia, exponencialne vyrovnavanie) 

• modifikovane tradicne metody (adaptivne metody, stochasticke metody, autoreg- 
resny model ARMA a ARIMA, regresia zalozena na podpomych vektoroch) 

• metody umelej inteligencie (evolucne algoritmy, fuzzy logika, neuronove siete, zna- 
lostne expertne systemy a ine.) 

Kazda z uvedenych metod ma svoje vyhody a nevyhody. Vzhl’adom na komplexny 
problem predikcie casovych radov, resp. predikcie spotreby el. energie, je zlozite zvolit’ 
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jednu konkretnu metodu, ktora dokaze vzdy poskytnut’ spravny vysledok. Riesenim je 
pristup zalozeny na kombinacil viacerych predikcnych modelov. 

7 Ucenie suborom metod 

Ucenie suborom metod (Ensemble Learning) je jednym z pristupov z oblasti strojoveho 
ucenia, ktore moze byt’ defmovane, ako proces pozostavajuci z trenovania a kombina¬ 
cie rozlicnych modelov, ktorych ulohou je vyriesit’ zadany problem [6], Podobny pri¬ 
stup moze byt’ pozorovany v l’udskom spravani. Prikladom moze byt’ parlament alebo, 
senat, kde sa pri prijimani doleziteho rozhodnutia, beru do uvahy nazory viacerych od- 
bomikov. 

Ucenie suborom metod moze byt’ pouzite na zlepsenie vysledkov zhlukovacich, kla- 
sifikacnych i predikcnych modelov [8, 9], Uvedeny proces je silne zavisly na troch 
hlavnych komponentoch. Prvy komponent zabezpecuje generovanie rozlicnych pre¬ 
dikcnych modelov. Druhy komponent rozhoduje o tom, ktore vygenerovane modely sa 
ponechaju a ktore budu kvoli nepostacujucim vysledkom odstranene zo zakladneho su- 
boru. Posledny komponent je zodpovedny za integraciu jednotlivych modelov s ciel’om 
spresnit’ konecny vysledok predikcie. 

7.1 Generovanie sady predikcnych modelov 

Pod generovanim sady predikcnych modelov sa rozumie natrenovanie jednotlivych 
modelov na mnozine trenovacich dat. Pri trenovani modelov sa vyuziva heterogenny 
pristup (jednotlive modely su trenovane na rovnakych datasetoch). Pouzite predikcne 
modely (Tab. 1) sa llsia sposobom vypoctu a narokmi na vel’kost’ trenovacieho okna. 

Niektore predikcne modely mozu byt’ rozsirene o externe faktory, ako naprlklad 
predpovede pocasia, ktore dokazu spresnit’ vyslednu predikciu. 

Tab. 1 Zoznam pouzitych predikcnych modelov v subore. Modely mozu byt’ rozdelene do 
troch skupin: TM - tradicne metody (regresne modely), MTM - modifikovane trad, metody 
(modely zalozene na analyze casovych radov) a AI - modely zalozene na umelej inteligencii. 



Nazov modelu 

Typ mo¬ 
delu 

Zahrnutie 

extemych 

faktorov 

1 

Viacnasobna lineama regresia 

TM 

ano 

2 

Dopredna neuronovu siet’ 

AI 

ano 

3 

Rekurentna neuronovu siet’ 

AI 

ano 

4 

Hlboka neuronova siet’ 

AI 

ano 

5 

Regresia zalozena na podpomych vektoroch 

AI 

ano 

6 

Nahodne lesy 

AI 

ano 

7 

Plavajuci priemer 

MTM 

nie 

8 

ARIMA model 

MTM 

nie 

9 

Dekompozicia casoveho radu a predikcia 
jednotlivych zloziek 

MTM 

nie 
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7.2 Orezavanie sady modelov 

Orezavanie (redukcia) sady modelov sluzi na spresnenie vyslednej predikcie a znizenie 
vypoctovej a pamat’ovej narocnosti. V tomto kroku su zo sady modelov vyucene mo- 
dely s najhorsimi vysledkami. Najcastejsie sa vyuzivaju dva prlstupy: rozdel’ovacl 
a vyhl’adavacl [6], 


7.3 Integracia 

Poslednym krokom v procese ucenia suborom metod je spojenie vysledkov jednotli- 
vych modelov v sade do finalneho vysledku. Na rozdiel od klasifikacnych metod, kde 
je finalny vysledok urceny na zaklade najcastejsej odpovede, pri regresnych proble- 
moch je integracia vysledkov predikcnych modelov komplikovanejsia. Jedna z metod, 
ktora sa na tento ucel vyuzlva je vazeny priemer (rovnica 1): 


F 


final 


V m ,w t *F 

A—ii=\ ' j 

m 

> , w, 

i—ii =1 ' 


(1) 


kde Ffinal je vysledna predikcia, F i je predikcia j-teho modelu, m je pocet predikc¬ 
nych modelov, ktore vstupuju do fazy integracia a w, je vaha j-teho modelu vo vazenom 
priemere. 

Vazeny priemer v porovnanl s obycajnym priemerom, umoznuje pomocou vah zvy- 
hodnovat’ a penalizovat’ modely na zaklade ich vysledkov. Nove hodnoty vah su vypo- 
cltane na zaklade chyb predikcie jednotlivych modelov. Chyba predikcie je vypocltana 
ako priemema absolutna percentualna chyba MAPE - Mean Absolute Percentage Error 
[2]. 

Vypocet vah mozeme charakterizovat’ ako optimalizacny problem s ohranicenlm. 
Na zaklade nasho predchadzaj uceho vyskumu [ 1 ] sme sa rozhodli pouzit’ optimalizacne 
algoritmy: Umela kolonia vciel [3] a Optimalizacia rojom castle [4], ktore dosiahli 
dobre vysledky a mali nlzku casovu narocnost’ v porovnanl s ostatnymi testovanymi 
algoritmami. 


8 Zaver 

Nasim cierom je vytvorit’ metodu zalozenu na ucenl suborom metod, ktora bude 
vhodna pre pracu casovymi radmi a bude produkovat’ presne predikcie. Vytvorenu me¬ 
todu ako aj nase predpoklady testujeme na datach o spotrebe elektrickej energie. Nasim 
zamerom je vytvarat’ presnejsie kratkodobe predikcie spotreby elektrickej energie po¬ 
mocou ucenia suborom metod a biologicky inspirovanych algoritmov. Experimenty 
vykonane na realnych datach z inteligentnych meracov potvrdzuju uspesnost’ navrhnu- 
tej metody [5]. V ramci experimentov [ 1 ] sme skumali aj schopnost’ roznych biologicky 
inspirovanych algoritmov zvysit’ predikcnu schopnost’ suboru metod v situaciach, ked’ 
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sa v datach objavuju nepredvidatel’ne zmeny (nahle ako aj postupne). Najlepsie vy- 
sledky dosahovali rojovo inteligentne algoritmy. 

Pod'akovanie: Tato publikacia vznikla vd’aka ciastocnej podpore projektu VEGA 
1/0752/14. 
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Annotation: 

Method of prediction models combination used to precise prediction of power load 
consumption 

Ensemble learning is one of the machine learning approaches that can be defined as the process 
of training and combining diverse models to solve a particular computational problem [6], En¬ 
semble learning can be used for improving the performance of clustering, classification or pre¬ 
diction [8,9]. The whole process depends on three main parts: first the way how the set of diverse 
models is created, second which models are eliminated from the set depending on their perfor¬ 
mance and third the way of integrating the models into final prediction. In our research we in¬ 
vestigate different ways of constructing accurate ensemble. We focus on different weighting 
schemes of predictive base models, especially from the field of biologically inspired algorithms, 
like Artificial Bee Colony [3] or Particle Swarm Optimization [4], Our goal is to create ensemble 
learning method suitable for precise short-term load predictions. 
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Abstrakt.Dolovanie v datach je vel’mi dolezite pre moderne riadenie logistiky, 
co pomaha zlepsit’ spravne rozhodnutia, zvysit’ predaj, znizif naklady a pod. 
V kontexte tychto rozhodnuti hraju kl’ucovu rolu nielen spravne informacie 
a znalosti, ale aj sposob ako ich efektivne pouzit’. Predkladany clanok sa zaobera 
problematikou analyzy dat pre zlepsenie rozhodovania vo vybranom logistickom 
procese - vyber vodicov na planovane dodacie trasy. Clanok predstavuje nase 
livodne kroky pri rieseni jednotlivych poduloh pomocou rozhodovaclch stromov 
a jednoduchych statistickych metod, ktore sme aplikovali na data z konkretnej 
firmy a dosiahnute vysledky prezentujeme v tomto clanku. 

Typ prispevku: Doktorandske sympozium 

Kl’iicove slova: analyza dat, logistika, RapidMiner 


1 Uvod 

Analyza dat a logistika do seba dokonale zapadaju. Logisticke spolocnosti riadia casto 
vel’ky tok tovaru, pricom vytvaraju mnozstvo dat. Tieto data v sebe skryvaju potential 
pre nove obchodne modely. Riadenie zasob, sledovanie zasielok a dokonca umiestenie 
senzorov vo vozidlach, vsetky tieto cinnosti poskytuju vel’ke mnozstvo dat [1] [3], Pre 
logisticke podniky je t’azke uskutocnovat’ vcasne a presne rozhodnutia na riadenie pro¬ 
cesu a prevadzkovu cinnost’ logistiky [2], Technologie dolovania v datach a statisticke 
analyzy mozu pomoct’ pochopit’ spravanie zakaznikov a vykonat’ zodpovedajucu stra- 
tegiu, cim firma dokaze znizit’ riziko plynuce z chybneho rozhodnutia [8], 


1.1 Sucasny stav problematiky 

V poslednej dobe rozne vyskumne studie poukazali na vyhody pouzitia vel’kych dato- 
vych metod v oblasti logistiky a riadenia dodavatel’skeho ret’azca. V nami sledovanej 
oblasti logistiky boli publikovane zaujimave vysledky v ramci FP7 projektu Compa¬ 
nion. Technicka sprava [5] popisuje okrem ineho vysledky vyskumu zameraneho na 
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tvorbuprediktivnych modelovspotreby paliva pre nakladne vozidla. Autori pre predik- 
ciu spotreby paliva vyuzivaju metody strojoveho ucenia a to prostrednlctvom roznych 
faktorov ako napr. charakteristiky trasy, vozidla, hmotnost’ nakladu, spravanie vodica 
pri riadeni, ale aj pocasie. Vytvoreny model by tak mal dokazat’ stanovit’ ocakavane 
naklady na roznych cestach pri planovanl a optimalizacii trasy. Data pochadzali z nie- 
kol’kych roznych zdrojov, vratane databazy pre spravu vozoveho parku, konfiguracie 
vozidla databazy, cestnej databazy a historickych udajovo pocasi. Autori vytvorili pre- 
diktlvne modely pomocou lineamej regresie, nahodnych stromov, SVM a neuronovych 
sietl. Najlepsiu presnost’ dosahovali nahodne stromy (pri predikcii v minutovych inter- 
valoch priemerne 21,6% chyba, pri 10-minutovych intervaloch 13,1%). Vysledky uka- 
zali napr. ze elm vacsia perioda vzorkovania (dlhsl horizont prognozy), tym mensia 
chyba predikcie. Medzi najvyznamnejsie rozhodovacie atributy patria hmotnost’ vo¬ 
zidla, rychlost’ vozidla, sklon cesty, smer vetra a rychlost’ vetra. 

My chceme tento vyskum posunut’ smerom k podpore rozhodovania pri vybere vo¬ 
dica na konkretnu trasu. V sucasnosti je vo zvolenej firme pridel’ovanie vodicov vyko- 
navane manualne podl’a toho, kde sa prave nachadza a ci ma vodic narok na vol’no. 
Podl’a prace je odvolany na cestu, tzv. „tumus“, ktory trva priblizne 20 - 25 dm [7], Do 
rozhodovania pri pridel’ovani vodicov ale urcite vstupuju aj d’alsie dolezite charakte¬ 
ristiky, ktore sa aktualne nezohradnuju. 

Vd’aka vel’kemu mnozstvu dostupnych dat o jednotlivych vodicoch, ako napr. vykon 
vodica, priemema spotreba paliva, dodrziavanie maximalnej rychlosti a tiez mnohe ine 
parametre, ktore dokazu poskytnut’ informacie napr. o style jazdy, priemernej rychlosti 
a pod., mozeme ziskat’ celkovy prehl’ad, pomocou ktoreho je mozne porovnavat’, ana- 
lyzovat’ a zostavovat’ modely a vykonavat’ experimenty pre roznych vodicov a pre 
rozne vozidla [4], 

2 Popis dat 

K dispozicii sme mali data, ktore boli ziskane prostrednlctvom systemu Danafleet On¬ 
line -Volvo Truck Corporation. Firma tento system vyuziva na komunikaciu s vozid- 
lami. Vozidla generuju informacie, ktore su prostrednictvom mobilnej telefonnej siete 
odosielane do systemu. Z vozidiel su ukladane do databazy a nasledne do vykazov. 
Data je mozne z vykazov exportovat’ do suboru MS Excel za ucelom d’alsich analyz. 
K dispozicii sme mali dve datove mnoziny od troch roznych vozidiel. Prva datova mno- 
zina bola ziskana z vykazu hodnotenia spotreby paliva a reprezentovala jazdny styl da- 
neho vozidla k urcitemu datumu. Okrem toho sa v datovom subore nachadzali aj d’alsie 
numericke atributy ako: datum, celkovy cas, celkova vzdialenost’, celkove hodnotenie, 
priemema rychlost’, priemerna spotreba paliva (1/100 km), celkove splodiny CO 2 , pred- 
vidanie, vol’ny dojazd, vyuzitie motora, zat’azenie, prekrocenie rychlosti, prisposobenie 
rychlosti, tempomat, vol’nobeh... Druhy datovy subor bol ziskany zo zostavy sledova- 
nia a obsahoval atributy ako: datum, meno vodica, stav paliva, prejdena vzdialenost’, 
miesto... 

Predspracovanie dat spocivalo v generalizacii miesta na konkretny stat, v ktorom sa 
vozidlo nachadzalo. Nasledne sme redukovali atributy a z druheho datoveho suboru 
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sme vybrali len datum, meno vodica a miesto. Pomocou atributu datum sme zlucili obi- 
dva datove subory jedneho vozidla. Taketo spojenie sme vykonali aj pre d’alsie dve 
vozidla. Nasledne sme ziskali tri datove vzorky, ktore sme zlucili do jednej. Po tomto 
zlucenl sme odstranili duplicitne hodnoty a silne korelovane atributy (korelacia vyssia 
ako 0.9) a vykonali diskretizaciu jednotlivych numerickych atributov (celkove hodno- 
tenie, predvldanie, zat’azenie motora...), a to nasledovne: 

• 0-59: Dobry vykon 

• 60-79:Priemer 

• 80-100: Potential k zlepseniu 

Po tejto uprave sme mohli zacat’s navrhom experimentov a modelovanlm. 

2.1 Navrh experimentov a modelovanie 

Ciel’om modelovania bolo zistit’, do akych krajln jazdia jednotlivy vodici najcastejsie 
a ake maju hodnotenie jazdneho stylu v danej krajine. Celkove hodnotenie predstavuje 
sposob hodnotenia jazdneho stylu vodica v urcitom okamihu. Do celkoveho hodnotenia 
sa beru do uvahy aj atributy ako Predvldanie, Prisposobenie rychlosti, Vyuzitie motora 
a prevodovky a Uplne zastavenie. Tieto hodnoty system berie do uvahy, na zaklade 
coho dokaze vyjadrit’ percentualne hodnotenie vodica v danom okamihu. 

Z jednoduchej statistickej analyzy sme zistili, ze krajina ako je Nemecko a Norsko 
bola najcastejsie navstevovana vsetkymi desiatimi vodicmi. Z grafu je moaie vidiet’, 
ze v Nemecku dosahovali vodici lepsie hodnotenie ako v Norsku. Je to aj z toho do- 
vodu, ze v Nemecku vodici jazdia prevazne po dial’nici, v Norsku musia zase prekona- 
vat’ vyskove rozdiely a jazdia prevazne po horskych priechodoch, co ma v>razny vplyv 
na vykon vozidla. 

Z grafu taktiez mozeme vidiet’ ako jednotlivi vodici zvladaju jazdu v danej krajine. 
Mozeme napr. povedat’, ze Vodic_F ma vyrazne lepsie hodnotenie ako Vodic H. Ta- 
kato analyza moze pomoct’ majitel’ovi firmy pri rozhodovani akeho vodica priradi na 
aku trasu. 
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Obr. 1 Graf celkoveho hodnotenia jednotlivych vodicov. 

Dalsi experiment mal skor popisny charakter. Chceli sme zistit’ kombinaciu faktorov, 
ktore maju kl’ucovy vplyv na vykon vodica, pricom sme vyuzili rozhodovacie stromy. 
Pouzili sme 20-nasobnu krizovu validaciu a ako kriterium pre vyber atributu sme pou- 
zili informacny zisk. Ciel’ovou premennou bolo celkove hodnotenie vozidla (Obr. 2), 
pricom modra farba znamena, ze vodic dosiahol v celkovom hodnoteni jazdneho stylu 
dobry vykon (hodnoty pred diskretizaciou 80 — 100), zelena farba znamena, ze vodic 
dosiahol v celkovom hodnoteni jazdneho stylu priememy vykon (hodnoty pred diskre¬ 
tizaciou 60 - 79). 

2.2 Vyhodnotenie rozhodovacieho stromu 

Presnosf nami vytvoreneho modelu je 83,52 %. Ciel’om rozhodovacieho stromu bolo 
urcit’ kombinaciu faktorov, ktore maju kl’ucovy vplyv na vykon vodica. Z vysledneho 
rozhodovacieho stromu mozeme teda povedaf, ze podstatnymi atributmi su Predvida- 
nie, Priemerna spotreba paliva (1/100 km), Priemerna rychlost’, Celkova vzdialenost’ 
(km) ako aj Najvyssi prevodovy stupen, Neusporna jazda, Pomer brzdeni/zastavenia. 
Tieto atributy maju vplyv na sposob hodnotenia jazdneho stylu vodica. Analyza ta- 
kychto dat moze odhalif oblasti, v ktorych je mozne znizit’ spotrebu paliva a poskytnuf 
vodicom tipy, v com sa mozu zlepsit’. 
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Obr. 2 Rozhodovaci strom (vysek). 


3 Zaver a buduca praca 

V praci sme sa zamerali na analyzu existujuceho procesu priradzovania vodicov na 
jazdne trasy vo zvolenej logistickej firme. Pomocou jednoduchej statistickej analyzy 
sme poukazali na skutocnost’, ako jednotiivi vodici jazdia v jednej z dvoch najcastejsie 
navstevovanych krajin. Takato analyza moze pomoct’ majitel’ovi vyriesit’ rozhodnutie 
o spravnom priradeni vodica na planovanu dodaciu trasu vzhl’adom k vykonu vodica 
a k jeho jazdnemu stylu. 

Ciel’ d’alsieho vyskumu bude zamerany na vytvorenie ako aj overenie modelu vodica 
a modelu dodacej trasy. Na zaklade analyzy dat ako aj vykonanych rozhovorov s vo- 
dicmi a d’alslmi relevantnymi aktermi procesu zostavlme model zahrnujuci vsetky roz- 
hodujuce faktory ovplyvnujuce vykon vodica, resp. charakteristiky dodacej trasy. Na- 
sledne vytvorime a experimentalne overime system pre podporu rozhodovania o prira- 
dzovani vodicov na dodacie trasy vo zvolenej logistickej firme. Meranim vykonnosti 
organizacie po a pred nasadenim systemu pre podporu rozhodovania zistime do akej 
miery je tento system prinosom pre zvolenu logisticku spolocnost’. 

Pod’akovanie. Tato publikacia vznikla vd’aka podpore Vedeckej grantovej agentury 
MSVVaS SR a SAV projekt c. 1/0493/16. 
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Annotation: 

Data analysis to improve specific business of processes logistics company. 

Data mining is very important for modern logistics management, which helps to improve the 
right decisions, increase sales, reduce costs, and so on. In the context of these decisions, not only 
the correct information and knowledge are essential, but also a way how to use them effectively. 
The article deals with data analysis to improve decision-making in the selected logistics process 
- selection of drivers for planned delivery routes. The paper presents our initial steps in addressing 
individual subtasks using decision trees and simple statistical methods that we have applied to 
the data from the selected company. The obtained results are described in this paper. 
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Abstrakt. Praca sa zaobera rozpoznanim podobnosti v kontexte plagiatorstva 
v programovych kodoch. Na zistenie, ako narocneje vytvorenie dokonaleho pla- 
giatu bol vykonany experiment, ktory ukazal, ze vytvorenie takehoto plagiatu je 
pomeme jednoduche a nevyzaduje hlbsie znalosti programovacieho jazyka. Na 
zaklade tohto experimentu bol navrhnuty nastroj PerfectPlaggie, ktory je schopny 
automatickej tvorby programovych klonov za ucelom tvorby datovej mnoziny na 
d’alsie testovanie. Dalej je opisany antiplagiatorsky system pouzivany na FIIT 
STU, ktory je unikatny tym, ze sa snazi o maximalne vyuzitie standardnych Uni- 
xovych filtrov. 


Typ prispevku: Doktorandske sympozium 

Kl’iicove slova: podobnost’, zdrojovy kod, klon, dokonaly plagiat, PerfectPla¬ 
ggie 


1 Uvod 

Kazdym dnom je 1’udstvom vytvorene obrovske mnozstvo dat - nachadzame sa v ere 
takzvanych vel’kych dat. Avsak vel’ke mnozstvo tychto dat su data podobne, niekedy 
az rovnake. 

Podobnost’ rozpoznavame napriklad za ucelom personalizovaneho odporucania, 
identifikacie nevyziadanej posty, refaktoringu, detekcie plagiatov alebo aj za ucelom 
odhal’ovania skodliveho kodu (malveru). 

Detekcia plagiatorstva je otvoreny a vyznamny problem. Pretoze s rozsirovanlm in- 
temetu a tym suvisiacej vol’nej dostupnosti vel’keho mnozstva dat ako su texty alebo 
zdrojovy kod su l’udia pokusanl zvolif si ,,1’ahsiu cestu“. 

Plagiatorstvo je definovane ako uvadzanie myslienok alebo textov niekoho ineho za 
svoje vlastne. 1 Studie [1,2] ukazuju, ze problem plagiatorstva v akademickej sfere je 
este vaznejsl, ako sme si doteraz mysleli. 

Uloha automatickej rozpoznanie podobnosti v celom procese urcovania plagiatov je 
zvjraznena na obrazku 1. 


i 


http://www.merriam-webster.com/dictionary/plagiarized 
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Obr. 1 Podobnost’ v procese hl’adania plagiatov 


2 PerfectPlaggie 

PerfectPlaggie je nazov experimentu a nasledne aj nazov navrhovaneho nastroja. Ex¬ 
periment, ako aj nastroj sa zaobera tvorbou dokonaleho plagiatu. Pod dokonalym pla- 
giatom rozumieme taky plagiat, ktory nie su schopne dostupne nastroje a metody ozna- 
cit’ za podozrivy. 

2.1 Experiment manualnej tvorby programoveho klonu 

Pri tomto experimente bolo sledovane ake vel’ke usilie a ake mnozstvo vedomosti je 
potrebne na zakrytie plagiatorstva v programovych kodoch. Za referencne nastroje, 
ktore vyhodnocovali mieru podobnosti boli vybrane nastroje SIM, JPlag [4], MOSS 
a Simian. Upravy boli rozdelene do troch urovni podl’a zlozitosti vykonania danych 
zmien. 



Obr. 2 Najdene percento podobnosti 

Obrazok 2 zobrazuje vysledky - najdenu mieru podobnosti medzi originalom a uprave- 
nou verziou (klonom, plagiatom). Ako je mozne vidiet’, po upravach tretej urovne nie 
je ani jeden z testovanych nastrojov schopny odhalenia podobnosti. Podl’a [3,5] maju 
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vsetky dostupne nastroje a metody problemy pri pouziti sofistikovanych typov zakry- 
vania. Zmeny, ktore boli vykonavane, boli vykonavane tak, aby simulovali vytvaranie 
tychto zmien pomocou pocitaca, co znamena, ze tieto zmeny je mozne vytvarat’ aj au- 
tomatizovane. 

2.2 Automatizovana tvorba klonov 

Manualne vytvaranie vel’kych datovych mnozin, ktore su potrebne, je zdlhave a moze 
viest’ ku chybam pri upravovani zdrojoveho kodu. Vyhodou takto vytvorenej datovej 
mnoziny je to, ze mame istotu, ze dane dvojice su klonmi a navyse mame aj informaciu 
ako tieto klony vznikli. 

Obrazok 3 znazornuje architekturu nastroja PerfectPlaggie, ktory je navrhnuty na 
automaticku tvorbu programovych klonov. Tento nastroj je navrhnuty tak, aby automa- 
ticky st’ahoval projekty s otvorenym zdrojovym kodom z internetu, vyberal z nich tie 
zaujimave (podl’a roznych metrik) a nasledne tieto subory modifikoval (roznymi typmi 
zakryvania). Vystupom tohto nastroja budu zmodifikovane subory, ktore vznikli z ori- 
ginalneho suboru. Tieto subory budu mat’ rovnaku funkcionalitu ako original. 


(Github) 


crawler 



User input 


^ - 



Obr. 3 Navrh architektury nastroja PerfectPlaggie 

Pri implementacii zakryvania sa berie aj ohl’ad na to, ze podozrive subory mozu byt’ 
kontrolovane l’udskym expertom. A teda napriklad pri zmene nazvov budeme vycha- 
dzat’ zo synonymickeho slovnika. 


3 Identifikacia plagiatov na FIIT 

Pouzivany system na detekciu plagiatov na FIIT STU je odlisny od inych systemov 
tym, ze sa snazil v maximalnej moznej miere vyuzivat’ standardne Unixove filtre. Tento 
nastroj sa sklada z dvoch casti - kontroly zadani a normalizacii koeficientov. 

Cast’ kontrola zadani porovnava dvojice suborov. Kazda kontrola sa sklada z viacerych 
urovni. Prva uroven je iba textove porovnanie suborov (diff), druha je textove porov- 
nanie suborov, avsak ignoruju sa uz prazdne riadky a biele znaky. Tretia uroven je 
druha uroven spolu s tym, ze vsetky alfanumericke znaky su nahradene znakom X - 
cim ziskame „strukturu“ programu. 
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Obr. 4 Proces vyhodnocovania podobnosti pre jednu uroven 

Stvrta uroven je druha uroven s odstranenymi komentarmi, navyse kazde slovo je na 
samostatnom riadku. Piata uroven ma navyse oproti stvrtej urovni usporiadanie riadkov 
podl’a abecedy. Vystupom tejto casti je sest’zlozkovy vektor, ktory obsahuje informacie 
o pocte riadkov spracovanych suborov, pocte potrebnych zmien, aby sme mali iden- 
ticke subory a priemernej vel’kosti zmeny. 

Cast’ normalizacie koeficientov upravuje vektor z predchadzajucej casti a na zaklade 
heuristicky zistenych hodnot urcuje podozrivost’ dvojice. 

Obrazok 4 zobrazuje porovnavanie jednej dvojice vysledky z jednej casti putuju do 
d’alsej casti (rurovite spracovanie). Konecnym vystupom je zoznam podozrivych dvo- 
ji'c. 


4 Zaver 

Experiment dokonaleho plagiatu ukazal, ze vytvorenie dokonaleho plagiatu nie je zlo- 
zite a ze usilie vlozene do tvorby takehoto plagiatu je nizsie, ako usielie, ktore by bolo 
potrebne na samostatne naprogramovanie riesenia zadania. Na zaklade tohto experi- 
mentu vznikol navrh nastroja PerfectPlaggie, ktory sluzi na vytvaranie vel’kych dato- 
vych mnozin, ktore budu pouzite na d’alsie testovanie novych nastrojov a metod. 

Antiplagiatorsky system pouzivany na FIIT STU ma zaujimavu architekturu, avsak 
je potrebne este d’alsie preskumanie a otestovanie tohto systemu, aby mohli byt’ navrh- 
nute zmeny, ktore by viedli k este lepsim vysledkom. 

Pod’akovanie: Tato publikacia vznikla vd’aka ciastocnej podpore projektu Prisposobo- 
vanie pristupu k infonnacnym a vedomostnym artefaktom zalozene na interakciach a 
kolaboracii v prostredi webu, Vedecka grantova agentura MSWaS SR a SAV, grant 
No.. VG 1/0646/15. 

Literatura 

1. ARWIN, C., TAHAGHOGHI, S.M.M. Plagiarism Detection across Programming 

Languages. In Twenty-Ninth Australasian Computer Science Conference (ACSC2006) . 
2003. s. 10. 

2. CHUDA, D. et al. The issue of (software) plagiarism: A student view. In IEEE Transactions 
on Education . 2012. Vol. 55, no. 1, s. 22-28. 

3. POTTHAST, M. et al. Overview of the 6th International Competition on Plagiarism 
Detection. In Notebook for PAN at CLEF 2014 . 2014. s. 845-876. 

















309 Doktorandske sympozium 


4. PRECHELT, L. et al. Finding Plagiarisms among a Set of Programs with JPlag. In Journal Of 
Universal Computer Science. 2002. Vol. 8, no. 11, pp. 1016-1038. 

5. TAHIR ALI, A.M. EL et al. Overview and comparison of plagiarism detection tools. In CEUR 
Workshop Proceedings . 2011. Vol. 706, s. 161-172. 


Annotation: 

Recognizing similarity of texts, programming codes. 

The work deals with the similarity in context of plagiarism in source codes. To find out how 
difficult it is to create perfect plagiarism we have done an experiment, that showed that creation 
of such plagiarism is quite simple and does not require much knowledge of programming lan¬ 
guage. PerfectPlaggie tool was designed according to this experiment. This tool is capable of 
creating software clones in order to automatically create large datasets. Also there is described 
plagiarism detection system, which is used at FIIT STU. This system is utilizing standard Unix 
filters to detect plagiarism. 
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Abstrakt. V tomto prispevku prezentujeme system pre dolovanie pravidiel nasa- 
deny vo forme cloudovej sluzby, ktoryje urceny pre analyzu vel’kych dat. Cie- 
l’om systemu je analyza vel’keho mnozstva udajov o roznych udalostiach s vyu- 
zitim prostriedkov datovej agregacie, zhlukovania, klasifikacie a predikcie. Sys¬ 
tem pozostava z dvoch komponentov implementovanych ako siefove sluzby. Ge¬ 
nerator prediktorov zabezpecuje zmysluplny sposob agregacie vel’keho mnozstva 
udajov a Extraktor pravidiel spravania sa venuje analyze tychto agregacii. Vy- 
sledkami systemu su predikcne pravidla pouzitel'ne pri podpore rozhodovania v 
oblastiach manazmentu, marketingu, segmentacie Zakaznikov, klasifikacie, pre¬ 
dikcie spravania, atd’. 

Typ prispevku: Doktorandske sympozium 

Kl’iicove slova: agregacia dat, prediktory, dolovanie pravidiel, cloudova sluzba 


1 Uvod 

Technologicky pokrok vposlednych desat’rociach vedie knarastu potreby spracuvat’ 
vel’ke mnozstvo dat v roznych oblastiach praxe. Vel’ke data [1] ovplyvnuju nas kazdo- 
denny zivot aplikaciami, ktore zahfnaju zdravotne systemy, socialne systemy, vel’ke 
vedecke experimenty, uloziska dat, logistiku a donaskove sluzby a pod. Hlavn>m prob- 
lemom vel’kych dat je ich mnozstvo, roznoroda struktura a ztoho vyplyvajuce prob- 
lemy so spracovanim v realnom case. Mnozstvo spracovavanych dat sa v sucasnoti do- 
stava na uroven exaskaly, kde jediny analyticky system potrebuje spracuvat’ vyse 10 18 
vypoctov za sekundu [2], Je zrejme, ze tento vyvoj vyzaduje nove systemove architek- 
tury pre akviziciu, transfer, ukladanie a spracovanie dat. Rastuci pocet aplikacii zacina 
vyuzivat’ vrstvene systemove architektury a cloudovu infrastrukturu s ciel’om zvladnut’ 
poziadavky na spracovanie tohto mnozstva dat [2], 
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1.1 Popis prebiehajuceho vyskumu 

Pri implementacii projektu [3] sa zameriavame na aplikacny vyskum a vyvoj softvero- 
vych rieseni umoznujucich efektivne riesenie roznych problemov v oblasti analyzy vel’- 
kych dat, vratane predaja, marketingu, personalizacie a odponicani, manazmentu ri- 
zika, optimalizacie vyrobnych procesov, skvalitnovania zdravotnej starostlivosti a pod 
[3][4], Jednym z vystupov projektu bude softverova platforma pre analyzu a optimali- 
zaciu procesov, poskytovana vo forme softveru ako sluzby. 

Nas vyskumny tim je rozdeleny na niekol’ko skupin, ktore sa zaoberaju mimo ineho: 
analyzou vel’kych textovych dat, aspektovo-zalozenou analyzou sentimentu v doku- 
mentoch, analyzou zdravotnickych dat, dolovanim pravidiel v procesoch a udalostiach 
a personalizovanymi odporucaniami. 

V ramci poslednej skupiny sa venujeme analyze spravania zakaznikov zalozenej 
na spracovani procesnych zaznamov (logov) a udalosti (napr. navsteva webstranky, 
zobrazenie polozky v e-shope, licast’ v emailovej kampani a pod.). Ciel’om analyzy je 
ziskanie znalosti vo forme klasifikacnych a predikcnych pravidiel, segmentacie zakaz¬ 
nikov na zaklade podobneho spravania a hl’adanie charakteristickych vzorcov sprava¬ 
nia. 


1.2 Deflnicia pojmu prediktor 

Pri analyze logov predpokladame semi-strukturovanu podobu dat vo forme tabul’ky 
udalosti a odpovedajucej tabul’ky entit (napr. zakaznikov) zodpovednych za tieto uda¬ 
losti. Kazdy riadok v tabul’kach reprezentuje jednu entitu alebo udalost’, pricom kazdej 
entite moze prisluchat’ viacero udalosti. Zaroven predpokladame neustaly narast mnoz- 
stva udajov v tabul’ke udalosti, ktory znemoznuje priamu analyzu tychto udajov. 
Z tohto dovodu udaje o udalostiach najprv agregujeme do novo vytvorenych atributov 
v tabul’ke entit (Obr. 1) a az tie nasledne analyzujeme. 



Obr. 1 Princlp generovania agregovanych atributov - prediktorov. 
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Kedze funkcii umoznujucich tuto agregaciu je teoreticky nekonecne mnozstvo, nasim 
ciel’om je identifikacia iba tych agregacii, ktore maju vysoku korelaciu s inym (ciel’- 
ovym) atributom v tabul’ke entit, a teda je mozne ich vyuzit’ pri generovani predikcnych 
pravidiel. Taketo korelovane agregovane atributy nazyvame prediktoiy [5], 

2 Predstavenie systemu 

Navrhnuty analyticky system (blizsie specifikovany v konferencnom prispevku [6]) po- 
zostava z dvoch hlavnych komponentov realizovanych vo forme nezavislych siet’ovych 
sluzieb. Prvym je Automaticky generatorprediktorov (APG - Automatic Predictor Ge¬ 
nerator), ktory pripravuje prediktory. Tie su nasledne spracovane v druhom kompo- 
nente, pracovne nazvanom Automaticky extraktor pravidiel spravania (ABRE - Auto¬ 
matic Behavior Rule Extractor), ktory nad tabul’kou entit generuje rozhodovacie stromy 
a z nich extrahuje najvyznamnejsie pravidla. 

Slovny popis algoritmu APG je nasledovny: 1) Sluzba akceptuje najmenej jednu ta- 
bul’ku entit a jednu alebo viacero tabuliek udalosti. 2) V tychto tabul’kach deteguje da- 
tovy typ vsetkych atributov. 3) Sucasne s tabul’kou je sluzbe urceny ciel’ovy atribut 
v tabul’ke entit, voci ktoremu su nasledne generovane nove prediktory. 4) Pri genero¬ 
vani su pouzite funkcie poctu, suctu, priemeru, maxima, minima a rozptylu pre nume- 
ricke atributy a funkcie poctu, poctu unikatov, najcastejsieho vyskytu a ndzvu najfrek- 
ventovanejsieho atributu pre nominalne atributy. 5) Novo vygenerovane atributy su na¬ 
sledne filtrovane s vyuzitim vlastnej implementacie Hierarchickeho aglomerativneho 
zhlukovania (HAC) [7], Metrikou pre zhlukovanie je Pearsonov korelacny koeficient 
pre numericke atributy a Chi-kvadrat test nezavislosti pre nominalne atributy. Vysled- 
kom zhlukovania su skupiny prediktorov s nizkou vzajomnou korelaciou, pricom pre¬ 
diktory v kazdej skupine su usporiadane podl’a klesajucej korelacie voci ciel’ovemu at¬ 
ributu. 6) APG nakoniec aktualizuje tabul’ku entit doplnenim vopred defmovanova- 
neho poctu najlepsich prediktorov z jednotlivych zhlukov. 

Aktualizovana tabul’ka entit mdze byt’ nasledne analyzovana pomocou ABRE, kto- 
reho algoritmus je nasledovny: 1) Sluzba akceptuje tabul’ku entit a pozadovany ciel’ovy 
atribut. 2) S ohl’adom na ciel’ovy atribut je vygenerovany rozhodovaci strom s pouzitim 
standardnej procedury Rekurzivneho delenia [8], kde ako selekcne kriteriumje pouzity 
Gini Index [9], 3) Nakoniec su z rozhodovacieho stromu extrahovane [10] pravidla 
s najvacsim pokrytim a istotou, ktore su zaroven vysledkom celkovej analyzy a pred- 
stavuju nove znalosti pouzitel’ne pre interpretaciu buduceho spravania entit. 

2.1 Implementacne poznamky k modulom APG a ABRE 

Modul sluzby APG je implementovany pomocou jazyka Python s vyuzitim balikov 
SciPy a NumPy pre vypocet korelacnych koeficientov. Generator agregovanych atri¬ 
butov je nasej vlastnej implementacie, rovnako ako HAC filter [7]. Komunikacne roz- 
hranie sluzby je zalozene na mikro-serveri Flask. Sluzba umoznuje spracovanie sprav 
vo formate JSON (a v obmedzenej miere aj CSV suborov). Modul ABRE je v sucasnoti 
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implementovany ako proces v analytickom nastroji Rapid Miner Studio a bude expor- 
tovany do podoby webovej sluzby s vyuzitim prostriedkov Rapid Miner Servera. Obe 
sluzby budu implementovane s pouzitim kontajnerovej technologie Docker v kvazi-se- 
parovanych virtualnych prostrediach. Kontajnery pre modul APG a pre Rapid Miner 
Server su dokoncene a preliminame otestovane. Implementacia ABRE je naplanovana 
v najblizsom obdobi. 

3 Predbezne vysledky a diskusia 

VzhTadom k stale prebiehajucej implementacii systemu a podpomej cloudovej infra- 
struktury [3], nebolo doposial’ mozne jeho celkove testovanie a overeniu bola podro- 
bena iba zakladna funkcnosf jednotlivych modulov. Na tento ucel bola pouzita data- 
baza historickych udajov o zakaznikoch (entitach) a prislusnych nakupnych udalos- 
tiach, pozostavajuca zo 7 461 zaznamov zakaznikov a 10 002 zaznamov udalosti. Ta- 
bul’ka zakaznikov obsahovala dva aplikovatel’ne atributy, z ktorych jeden bol zvoleny 
ako ciel’ovy atribut vyslednej analyzy. Tabul’ka udalosti obsahovala 8 aplikovatel’nych 
atributov s celkovym poctom 80 016 hodnot. Kedze dva atributy boli nominalne a sest’ 
numerickych, algoritmus APG vygeneroval 2*4+6*6=44 unikatnych agregovanych at¬ 
ributov. Tieto boli nasledne filtrovane s ciel’om ziskat’ pozadovane tri najlepsie predik- 
tory. Cele spracovanie prebehlo vramci modulu na jednom jadre 2GHz CPU v case 
nizsom nez 5 sekund. Vysledne prediktory (spolu so zvysnym nepouzitym atributom 
tabul’ky zakaznikov) boli pouzite v module ABRE na vygenerovanie 17 pravidiel 
v case priblizne 3 sekundy (rovnako na jednom jadre 2GHz CPU). Najlepsie pravidlo 
pokrylo 6 581 prikladov, z ktorych spravne klasifikovalo 3600. Druhe najlepsie pra¬ 
vidlo pokrylo 496 prikladov a spravne klasifikovalo 322. 

Tieto vysledky potvrdzuju zakladnu funkcnosf navrhnuteho konceptu cloudovej 
sluzby najma z pohl’adu vykonu. Samozrejme, ciel’om do buducnosti je zlepsit’ vy¬ 
sledky analyzy, co je mozne dosiahnuf viacerymi sposobmi. V pripade modulu APG je 
mozne okrem sucasnych agregacnych funkcii uvazovat’ aj o d’alsich sposoboch gene- 
rovania novych prediktorov. V pripade modulu ABRE je potrebne doladif parametre 
pouzite pre generovanie rozhodovacieho stromu, resp. generovat’ tieto pravidla priamo. 
Hlavnym ciel’om je vsak dokoncenie implementacie rozhrani medzi cloudovymi sluz- 
bami a ich nasadenie na sprevadzkovanej cloudovej infrastrukture [3], 

4 Zaver 

V tomto clanku sme predstavili zakladny smer vyskumneho projektu [3] realizovaneho 
na nasom pracovisku a popisali sme jeden z jeho vystupov [6]. S ohl’adom na ul’ahcenie 
popisu sme specifikovali zakladnu defmiciu pojmu prediktor. Nasledne sme popisali 
algoritmus sluzby APG, ktora je schopna agregovat’ relativne vel’ke mnozstva dyna- 
micky generovanych udajov z procesnych logov, a algoritmus sluzby ABRE, ktora 
umoznuje agregovane udaje zmysluplne analyzovat’. Nakoniec sme uviedli predbezne 
vysledky testovania modulov reprezentujucich jednotlive sluzby. 
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Annotation: 

Automatic Generation of Predictors Used for Rule-Mining 

In this paper we present a proposal for a data-mining system deployed as a cloud service which 
is supposed to be used for a big data analysis. The main purpose of the system is the analysis of 
a vast number of event logs using means of data aggregation, clustering, classification and pre¬ 
diction. The system is composed of two components implemented as software services. The Au¬ 
tomatic Predictor Generator is supposed to provide a meaningful way to aggregate large amounts 
of data and the Automatic Behavior Rule Extractor deals with proper analysis of these aggrega¬ 
tions. Results of the system are the prediction rules usable for support of decision-making and 
in areas such as management, marketing, customer segmentation, classification, behavior pre¬ 
diction etc. 
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Abstract. Grammar representation offers useful features that can be used in other 
aspects of computing than the standard language interpretation. One of such as¬ 
pects that is addressed in this paper, is representation of any meaningful written 
text in a single, non-redundant form. Such a form stores each distinct word sep¬ 
arately, thus reduces the entire size of a document. Another size reducing feature 
of the grammar form is its ability to abstract structure away from its content. 
Therefore by using lambda calculus application principle, we can create a super¬ 
combinator form of text substructures. This form, when applied on the arguments 
which are words themselves, produces the original text back. We show that this 
form offers reduction in total amount of language elements. We also show that 
supercombinators represent reusable language elements that can be used across 
analysed texts. 

Contribution type: PhD Symposium 
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1 Introduction 

Grammars are widely used formalism generally used as a tool for language representa¬ 
tion. But as Klint, Lammel and Verhoef pointed out in [4], we can use them for other 
purposes as well, for example data compression, structure representation or feature ex¬ 
traction. The issue addressed in this paper revolves around representation of written 
natural language text in a compact, non-redundant form. We can compress text with the 
use of context free grammars, as Nevill-Manning and Witten have done in [7] where 
they’ve used Sequitur algorithm. But this form leaves us with many repetitive structures 
and symbols. We aim to reduce the total number of used elements. 

This work correlates with the field of formal grammar inference [1,3,10] and natural 
language induction [2, 8], We add upon those inferred grammar and store them in a 
non-redundant applicative supercombinator form. This form was devised in our previ¬ 
ous work [5, 6], where we used only regular grammars as a basis for our experiments. 
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Recently, we extended this process to the realm of context free grammars, thus enabling 
the processing of inferred or induced grammars. 

The main contributions of this paper are: 

• We briefly present supercombinator set constructing algorithm that is capable to 
store any context free grammar in a non-redundant applicative form. This is the topic 
of Section 2. 

• We present the results in the Section 3, where we compare the number of grammar 
elements of processed natural language text obtained from Sequitur algorithm and 
our compressed supercombinator form of that Sequitur grammar. We show that the 
amount of elements drops significantly, since our form is non-redundant. The origi¬ 
nal text can still be reconstructed from our form by simple function application. 

2 Supercombinator Form and its Construction 

In this section, we show how we can represent regular grammar in a form of enriched 
lambda calculus, which is a basis of our supercombinator form. Such enriched lambda 
calculus is extended with the meta-operations of processed grammar. 

Let’s consider a simple regular expression (1). It generates either a sequence a b or 
zero to n repetitions of a symbol c. 


a b | (cy (2) 

We see that expression (1) contains three meta-operations: concatenation, alternation 
and Kleene star closure. So in this case, extended lambda calculus not only contains 
standard variable, lambda application and lambda abstraction operations but also ex¬ 
pressions meta-operations as well. In the Table 1 we see the complete set of supercom- 
binators constructed from the expression (1). We see the use of by meta-operations, as 
concatenation is represented by symbol +, alternative with symbol | and Kleene star as 
usual ( )*. The second column contains possible arguments for each supercombinator. 
As the first supercombinator is unary, the comma separating arguments means that only 
one of them may be used. Arguments without any comma represent a sequence of ar¬ 
guments. The main (or top) supercombinator is L 3 , by which application we obtain orig¬ 
inal expression (1) back. Other supercombinators represent structure. We may notice 
that each supercombinator is constructed just with one meta-operation. We can decom¬ 
pose any regular grammar that way. 


Tab. 1 Supercombinator form of expression (1). 


Supercombinator 

Arguments 

O 

* 

o 

* 

II 

o 

a, b, c 

L 1 = Axq.Ax^ L° x 0 + L° x t 

ab 

L 2 = Ax 0 . ( L° x 0 )* 

c 

L 3 = Ax 0 .Ax 1 .Ax 2 . L 1 x 0 x 1 | L 2 x 2 

abc 
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2.1 Context free extension 

What about CFGs? As we pointed out in [9], CFG non-terminals may be viewed as 
higher order jumps into another expression. This is incorporated in our supercombina- 
tor obtaining algorithm. Each rule of CFG is represented by its own top supercombina- 
tor. And each call of non-terminal inside of a rule body is therefore just a call of that 
supercombinator with its own arguments. Therefore it is possible to construct super- 
combinator form of any CFG, where the starting rule is represented as the top super¬ 
combinator. This supercombinator has as its arguments all possible terminal symbols, 
represented as a non-redundant list. The second non-redundancy effect is achieved by 
reusing supercombinators. 

In the Table 1 we see different supercombinators. But they all represent some spe¬ 
cific structure. The L° supercombinator represents ID function, and is being used three 
times for three different arguments in the case of expression (1). But the larger grammar 
is, the greater number of supercombinators is reused. For example, the L 1 supercombi¬ 
nator represents a sequence of any two distinct symbols. It is reasonable to believe that 
such supercombinator is rather heavily used in larger grammars. In the next chapter, we 
confirm this reasoning with an experiment. 

3 Experimental Results 

In order for our supercombinator algorithm to work, we need a grammar. Written text 
is a plain sequence of words. We would obtain only two supercombinators from it (one 
for Id function and the second would be a long Top supercombinator representing a 
sequence of n symbols). We can however construct simple CFG with the use of Sequi- 
tur algorithm [7], Such CFG generates only the original text; it does not offer any gen¬ 
eralisation. But for this experiment it is adequate, since we want to show the reduction 
of elements. We are going to use a book sample obtained from the King James Bible. 
We use the entire New Testament as our sample. 

Table 2 shows the amount of Sequitur rules of certain arity (i.e. how many terminals 
and non-terminals a rule has) compared to operation arity of supercombinators (How 
many supercombinator calls act as operands of lambda calculus n-ary meta-operation, 
in this case only concatenation. For example both L 1 and L 3 from Table 1 are binary). 


Tab. 2 Sequitur rules compared to Supercombinators, part 1. 


Arity 

0 

2 

3 

4 

5 

6 

7 

8 

9 

Sequitur 

1 

13782 

714 

242 

76 

52 

27 

19 

8 

Super- 

com. 

1 


188 

142 

65 

51 

27 

19 

8 

Arity 

10 

11 

12 

13 

14 

16 

17 

20 

Total 

Sequitur 

6 

4 

2 

1 

3 

2 

1 

1 

14942 

Super- 

com. 

6 

■I 

2 

1 

3 

2 

1 

1 

808 
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A rule with certain arity is always transformed to a supercombinator of the same oper¬ 
ational arity. We see that the greatest reduction occurs in the lower arities, while going 
up from the arity of 7, there is no reduction at all. This is normal, since large arity rules 
are rather rare and they differ from each other in their structure. 

There is also one supercombinator (and rule) not shown in the tables. It is the top 
one (as well as the starting rule) with the arity of 86461. It is accounted for however in 
the total result in the Table 2. Total count shows significant reduction of elements. Re¬ 
member that any distinct symbol (in this case a word) is stored in our form only once, 
thus the element reduction is indeed substantial. 

4 Conclusions 

We have rather briefly shown a way to represent CFG in a non-redundant applicative 
supercombinator form. Such form offers significant element reduction of Sequitur 
grammars obtained from text. We show this on a sample obtained from The King James 
Bible. We have generated a Sequitur grammar from the sample and then processed it 
with our algorithm. We show that the reduction of used elements is indeed rather sig¬ 
nificant, where 14942 grammar rules are transformed into 808 supercombinators. 
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Abstrakt. Jednou z moznych analyz nestrukturovanych textov je modelovanie 
temy, ktore sa snazi odkryvat’ skryte tematicke struktury v tychto textoch. Mo¬ 
delovanie temy moze byt’ uzitocne hlavne v kontexte socialnych sieti, kde moze 
sluzit’ pre analyzu v case krizovych situacii, zavedeni noveho produktu na trh, 
atd’. V sucasnosti vzniklo niekol’ko modifikacii klasickych pristupov, ktore vy- 
tvaraju hierarchiu tem. Tie ponukaju casto krat podrobnejsiu analyzu ako kla- 
sicke pristupy. Ciel’omprezentovaneho clanku je preto predstavit’ niekol’ko moz¬ 
nych pristupov hierarchickeho modelovania temy zalozenych na vyuziti formal¬ 
nej konceptovej analyzy, ktora sluzi na analyzu objekt-atributovych modelov. 
Clanok taktiez ponuka experimentalne overenie jedneho z pristupov na prispev- 
koch zo socialnej siete Twitter. 

Typ prfspevku: Doktorandske sympozium 

Kl’iicove slova: modelovanie temy, socialne siete, fonnalna konceptova ana- 
lyza, pnidy dat 


1 Uvod 

Socialne siete sa stali v poslednych rokoch jednym z najvyznamnejsich komunikacnych 
prostriedkov dnesnej doby. Denne sa na nichprodukuje obrovske mnozstvo prispevkov 
napriklad na socialnej sieti Twitter 1 je denne publikovanych okolo 340 milionov pri¬ 
spevkov. Tieto prispevky casto krat odrazaju nazory a postoje pouzivatel’ov na rozlicne 
produkty, osoby, udalosti a pod. 

Data zo socialnych sieti si ziskavaju svoje miesto najma v oblasti marketingu, kde 
ich spravne pochopenie a reprezentacia mozu spolocnosti priniest’ konkurencnu vy- 
hodu. Mozu bjh’ vyuzite napriklad pri krizovej analyze, mediami, zavedeni noveho pro¬ 
duktu na trh a pod. 


i 


https://twitter.com/ 



Hierarchicke modelovanie temy nadprudmi ddtzo socialnych sieti s vyuzitim FCA 322 


Ako je vidiet’ analyza dat zo socialnych sieti ma siroku oblast’ vyuzitia. Avsak pri 
ich spracovani narazame na niekol’ko problemov z ktorych najhlavnejsim je kvantita 
publikovanych prispevkov, kde manualna analyza takeho mnozstva dat by bola casovo 
vel’mi narocna. Preto je potrebne tuto cinnost’ zautomatizovat’. Jedno z moznosti je 
pouzitie metod modelovania temy, ktore nam ukazalo novy sposob sumarizacie, vyhl’a- 
davania a prehl’adavania textov. Zakladnou myslienkou modelovania temy je odkryva- 
nie skrytych tematickych struktur medzi vstupnjnni textami. 

Z toho dovodu sa budeme v tomto clanku venovat’ experimentalnym pristupom hie- 
rarchickeho modelovania temy s vyuzitim formalnej konceptovej analyzy. 

2 Formalna konceptova analyza 

Formalna konceptova analyza (FCA) [1] patri medzi metody analyzy dat, ktorej popu- 
larita vzrastla najma v poslednych rokoch. Svoje vyuzitie nachadza v mnohych oblas- 
tiach ako dolovanie v datach, navracani informacii, dolovani asociacnych pravidiel 
atd’. 

FCA sa da vyuzit’ podobne ako techniky hierarchicke ho zhlukovania, vysledkom 
tejto analyzy su najdene suvislosti v datach tzv. konceptovy zvaz. Konceptovy zvaz 
reprezentuje kolekciu formalnych konceptov (typ hierarchie konceptov), ktore su hie- 
rarchicky zoradene. 

Existuje niekol’ko metod na budovanie konceptovych zvazov niektore z nich su pre- 
zentovane a porovnane v praci [2], Dalsou metodou vyuzivanou aj v nasej praci je zov- 
seobecneny jednostranne fuzzy konceptovy zvaz (GOSCL). GOSCL je model formal¬ 
nej konceptovej analyzy, ktory na generovanie konceptoveho zvazu vyuziva jednos- 
trannu fuzzifikaciu. Vyhodou tohto modelu je, ze dokaze generovat’ koncepty z objek- 
tov pozostavajucich z roznych typov atributov (nominalne, ordinalne, numericke a ine.) 
a taktiez patri medzi inkrementalne algoritmy tj. je ho mozne vyuzit’ na spracovanie 
prudov dat. Viac informacii o GOSCL je mozne najst’ v praci [3], 

3 Modelovanie temy 

V poslednych rokoch bolo prezentovanych niekol’ko pristupov k modelovaniu tern. Jed- 
nym z najpopularnejsich je Latentna Dirichletova Alokacia (LDA) [4], Z LDA bolo 
vytvorenych niekol’ko jej rozsireni [5, 6]. Taktiez bolo predstavenych niekol’ko odlis- 
nych pristupov ako napr. v praci [7], Avsak v sucasnosti tieto modely neponukaju dos- 
tatocnu analyzu a preto sa do popredia dostavaju modely, ktore vytvaraju hierarchiu 
tem. Medzi taketo modely patria napr. [8, 9], 

4 Navrhovane pristupy 

V tejto kapitole predstavime mozne pristupy k modelovaniu temy pomocou FCA. Ako 
uz bolo spomenute FCA hl’ada zavislosti medzi vstupnymi datami a vytvara hierar- 
chicku strukturu. Avsak problemom pri aplikacii na ulohy modelovania temy je, ze 
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FCA v zakladnej forme nedokaze odkryvat’ skryte vzt’ahy medzi vstupnymi datami, 
dokaze odkryvaf len uz zname zavislosti. 

Predpokladame, ze dany problem by bolo mozne odstranit’ 2 moznjuni sposobmi. 
Prvy sposob predstavuje pouzitie FCA v spojenl s extemou metodou, ktora odkryva 
skryte struktury (napr. latentne semanticke metody) a FCA bude sluzit’ len na vybudo- 
vanie hierarchie. Avsak ak by mal byt’ tento prlstup vyuzity na spracovanie prudov dat 
musl byt’ pouzita inkrementalna externa metoda. 

Druhym moznym prlstupom je modifikacia FCA tak aby dokazala skryte struktury 
medzi vstupnymi datami najst’ napr. aplikaciou pravdepodobnostnych metod a kombi- 
naciou s inymi algoritmami strojoveho ucenia. 

Prvy pristup bol experimentalne overeny na vzorke 1000 prispevkov zo socialnej 
siete Twitter pojednavajucich o 4 hlavnych temach. Ako externa metoda v spojeni s 
FCAbola pouzita metoda SVD (rozklad na singulame hodnoty) [10]. Tento pristup bol 
porovnavany s klasickymi pristupmi modelovania temy (LDA) a zhlukovania (K-me- 
ans). Kvalita metod bola porovnavana na zaklade cistoty konceptov (purity) a poctu 
vygenerovanych konceptov. Ako je mozne vidiet’ z Tab 1 nami navrhovana metoda ma 
porovnatel’ne hodnoty cistoty ako klasicke pristupy. Vel’ky pocet konceptov pri navr- 
hovanej metode je sposobeny vytvorenou hierarchickou strukturou. Viac o tomto pri- 
stupe bude mozne najst’ v praci [11]. 


Tab. I Porovnanie standardnych metod s nami prezentovanym pristupom (FCA-SVD) s rozlic- 
nymi nastaveniami odpadu (odstranene koncepty pokryvajuce menej % objektov ako stanoveny 
prah) a K (najlepsich K singularnych hodnot z SVD) 


Metoda 

Odpad (v %) 

K (SVD) 

Cistota 

Pocet kon¬ 
ceptov 

LDA 

- 

- 

0,699 

4 

K-means 

- 

- 

0,766 

4 

FCA-SVD 

5 

4 

0,73 

91 

FCA-SVD 

5 

8 

0,69 

750 

FCA-SVD 

5 

20 

0,55 

2942 

FCA-SVD 

10 

4 

0,72 

68 

FCA-SVD 

10 

8 

0,67 

455 

FCA-SVD 

10 

20 

0,49 

784 

FCA-SVD 

0 

4 

0,74 

185 

FCA-SVD 

0 

8 

0,72 

1669 

FCA-SVD 

0 

20 

0,64 

25383 


5 Zaver 

V clanku boli prezentovane mozne pristupy k hierarchickemu modelovaniu temy s vy- 
uzitim formalnej konceptovej analyzy, kde jeden z pristupov bol aj experimentalne ove¬ 
reny a dosahoval porovnatel’ne vysledky ako standardne nehierarchicke pristupy. V bu- 
ducnosti by sme chceli rozslrit’ experimenty na vacsie datasety a najst’ vhodnejsie met- 
riky porovnavania hierarchie tem ako cistota a pocet zhlukov. 
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Annotation: 

Hierarchical Topic Modeling on Streams of Social Networks Data Based on the Formal 
Concept Analysis 

One of the possible ways to analyse unstructured texts is topic modelling, which is trying to 
uncover hidden thematic structures in these texts. Topic modelling can be particularly useful in 
the context of social networks, which can be used for analysis at the time of crisis situations etc. 
Some of the extensions in topic modelling are related to hierarchical modification of conventional 
approaches, which offer deeper analysis than classic topic modelling. The main aim of this paper 
is to describe a new ways of hierarchical topic modelling based on formal concept analysis. It 
also provide experimental evaluation of one proposed method. 
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Abstract. Complex object recognition applies in many sectors. However, design 
of these methods is difficult. Because of this there are many approaches. One of 
these is cluster based symbolization. Cluster based symbolization shows interest¬ 
ing results in this area. The approach is able to recognize events like human mo¬ 
tions or gestures. From specific point of view the approach is similar to human 
neural networks. There are bigger and smaller clusters that can be connected with 
other clusters via references. The clusters have varying attributes like average 
cluster size or symbol dispersion in cluster. In this paper we review these attrib¬ 
utes. For this goal we perform experiments using image objects. These objects 
represent letters of informal alphabets. 

Contribution type: PhD Symposium 

Keywords: Complex object recognition. Cluster based symbolization, Symbol¬ 
ization 


1 Introduction 

Symbolization process stands for very importing role in the process of advance object 
recognition [2,6,7], The process allows us to encapsulating a raw image data to abstract 
structures. Complex events or objects are easier analyze by abstract structures than the 
raw data. Therefore, we decide to focus on symbolization methods allowing the com¬ 
plex recognition. To data store we use a cluster approach. The cluster approaches may 
potentially allow us recognize unlearned data. These approaches were used for a com¬ 
plex recognition (Takano [6], Zhou [8]) and they had interesting results in this area. 
However, our suggested method uses object vectorization instead of the hidden Markov 
model to get symbols. The change of symbolization method could change properties of 
cluster system. Therefore, we mainly focus on cluster properties in the tests. 
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2 Implementation 

We divided implementation process to several parts: 

1. The first part represents image processing. Here we try to reduce image noise by 
using image denoising method described in [1,5]. The method has input attributes 
allowing us to control the level of image denoising. The attributes affect the speed 
of method execution. Therefore, it is necessary to choose right ratio between execu¬ 
tion speed and denoising level. There is no method with 100% noise reduction. In 
our case residual noise is located around object edges. This residual noise is reduced 
in following steps. 

2. The second part is image threshold. We use test images with well-defined back¬ 
ground in the experiment, because of this we decided to use the method based on 
one threshold level [4], The main problem is to find out ideal value of threshold. We 
have to calculate with the residual noise. Because our approach uses object vectori- 
zation process [3] for an object description, we may not be very precise in the defin¬ 
ing of threshold value - the object vectorization process is designed to adapt a low 
level of noise around object edges. We use the threshold value 64 for each mono¬ 
chromatic color channel. Values higher than the threshold are filtered out. 

3. The third part represents symbolization process. For this goal we use the method 
described in [3], The method is based on the object vectorization. The method allows 
us to describe outside and inside shapes. The method by vectorization of shape edges 
gets a string. As we wrote in the second part the method is able to reduce a noise 
around edges. A level of noise adaptation can be changed as required by one input 
method attribute. 

4. Final part creates system based on clusters. Previous step gives us an object repre¬ 
sentation as a string. And in this part we want to store these strings to clusters. For 
this aim we need a storing system. Based on Takano’s [6] and Zhou’s [8] works, we 
design system using measurement differences between strings. The process of string 
insertion to cluster needs a decision condition. The decision condition defines 
whether a string will be inserted to cluster, or not. If we use comparison based on 
comparing all characters in strings, the system will be slow when it is bigger. There¬ 
fore, we decided to calculate a hash value for each string. The decision condition 
uses this hash values for comparison instead of comparison of all string characters. 
If no cluster satisfying the decision condition exists, a new cluster is created. To 
control cluster size, we define a new attribute - deviation value. This value stands 
for maximal difference between a string hash and a cluster hash. The cluster hash 
value equals to the string hash value of the first inserted string. The hash function is: 

n 


/(5)=X(0‘+ l >(0) 


CD 
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Here S stands for a string on input, n is the number of symbols in this string, s(i) 
represents i h symbol in this string, and (i+1) is an increment providing a better clus¬ 
ter diversity. Excluding increment (i+1) from this formula, the number of clusters 
decreases and the number of strings in clusters increases. 

Two clusters are stored in the same cluster if absolute difference value of their 
hash values is lower or equal to the deviation value. However, if cluster contains two 
or more strings, it is calculated the average value of all hash string values in cluster. 
And the average value is used for comparison. The second method provides no re¬ 
dundancy in clusters. The method uses following formula: 

max(ml,w2)—1 

f{S\,S2)= Y j abs(s\(i)-s 2 (i)) ( 2 ) 

i=i 

Here function max returns maximum of lengths ml and m2, abs returns the absolute 
value of its argument, and functions sl(i) and s2(i), are defined in range i=0 ... 
max(m_l,m_2)-l as follows: 

sl(i)=sl(i), if 
sl(i) = 0. 
s2(i) = s2(i), if 
s2(i) = 0. 

3 Experimental result 

Now we want to present results of our method on test data. In tests we mainly focus on 
these specific attributes of clusters: 

1. Average cluster size - this attribute represents average number of strings in cluster. 

2. Average strings dispersion - this attribute shows average dispersion in clusters. The 
ability to differentiate two strings from each other’s depends on the difference be¬ 
tween these strings. Higher difference means better recognition ability. Dispersion 
value shows this difference between strings in clusters. 

3. Number of clusters - this attribute represents actual number of clusters in system. 

We can affect tested attributes through the deviation value. We continually changed the 
deviation value and observed changes in the tested attributes. We mainly focused on 
type of relations between the deviation value and the tested attributes. Based on the 
type of relations we can estimate a future system behavior. In general, two system prop¬ 
erties are problematic: unambiguous recognizing of symbols over time and a system 
cost. Measured values create linear relations. That's mean the system cost will grow 
linearly with growing data. And the average strings dispersion is linearly reduced to a 
point that it is not acceptable from the point of unambiguous recognizing of symbols. 
Through linearly relations the point of potential ambiguity is good estimable over time. 
Experimental results are shown on Obr. 1. 


ml >= m2 || ml < m2 && i < ml 
if ml < m2 && ml <= i < m2-l 
m2 >= ml || m2 < ml && i < m2 
if m2 < ml && m2 <= i < ml-1 
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Obr. I All alphabets -Average cluster sizes 


4 Conclusion 

In this paper we present our approach for symbolization based on clusters. We per¬ 
formed several experimental tests in order to find out the system properties. Results are 
presented in section 3. Results show potential in the object recognition. However, this 
area needs further research. 

Acknowledgment: This work was supported by project KEGA 031TUKE-4/2016 "In¬ 
tegrating software processes into the teaching of programming". 
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Abstrakt. Pri urcovani symptomov a predikovani vybranych chorob v medicine 
sa casto pouzivaju zdravotne vysledky pacientov, ktore su ziskane z roznych tes- 
tov. Pri l’ud’och trpiacich Parkinsonovou chorobou existuje viacero priznakov, 
avsak typickym symptomom je problem s artikulaciou a recou (dysfonia). Prave 
preto sme sa v tomto clanku zamerali na klasifikaciu pacientov podl'a ich reco- 
vych signalov pouzitim metod dolovania v datach (Naivny Bayesovsky klasifi- 
kator a rozhodovacie stromy - algoritmy C4.5, C5.0 a CART). Datova mnozina 
s ktorou sme pracovali sa sklada z hlasovych merani 31 osob, pricom kazda 
z nich rna v datach zastupenie priblizne 6 zaznamami. Najprv sme rozdelili data 
na trenovaciu a testovaciu mnozinu a po vytvoreni modelov sme vypocitali ich 
presnost’ z hodnot v kontingencnej tabul’ke. Okrem toho sme tiez pomocou hy- 
potez sledovali zavislost’ cieioveho atributu v binarnom tvare ku ostatnym atri- 
butom. 

Typ prfspevku: Doktorandske sympozium 

Kl’iicove slova: Parkinsonova choroba, rec, dolovanie v datach, klasifikacia 


1 Uvod 

Parkinsonova choroba [1] je druhym najcastejsim neurodegenerativnym ochorenim 
hned’ po Alzheimerovej chorobe. Vo vyspelych svetovych krajinach sa vyskytuje pri¬ 
blizne u 0,3% populacie, pricom rastucim vekom sa toto percento postupne zvysuje. 
U l’udi starsich ako 60 rokov uz hovorime priblizne o 1% a po veku 80 rokov dokonca 
o 4% l’udi z celkovej populacie. Z pohl’adu pohlavia sa castejsie vyskytuje u muzov 
v pomere 3:1, co moze suvisiet’ hlavne s ochrannymi ucinkami estrogenu u zien. 

Priznaky tejto choroby sa lisia u kazdeho jednotlivca individualne. Jednotlive sym- 
ptomy sa u niektorych prejavuju pomaly, u inych zas rychlejsie. Typickymi prvotnymi 
priznakmi mozu byt’ napriklad trasenie ruk, pazi, noh, ale aj spomalenie pohybu, stuh- 
nutie svalstva a problemy s recou [2], V sucasnej dobe vsak neexistuje vhodna metoda 
liecby, ktora by dokazala pacientov trpiacich touto chorobou uplne vyliecit’. Aspon 
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z casti im pomahaju lieky, ktore nahradzaju chybajuci dopamin, vd’aka ktoremu su pa- 
cienti udrziavani v dobrej kondicii. 

V nasom vyskume sme sa zamerali na diagnostiku Parkinsonovej choroby pomocou 
transformovanych dat zo zvukovych zaznamov do atributov vyjadrujucich signal reci. 
Tato oblast’je coraz viac vyuzivana, pretoze systemy zalozene na spracovanie signalov 
reci su menej nakladne a jednoduchsie na pouzitie [3]. Tento pristup moze napomahat’ 
vcasnej diagnoze choroby. Hlavnym ciel’om prace bolo zistit’, aku presnost’ dosiahnu 
vytvorene modely z dat recovych signalov a nasledne zhodnotenie ich realnej pouzitel’- 
nosti v praxi pre klasifikaciu pacientov. 

2 Podobne prace 

Parkinsonova choroba sa vyskytuje u l’udi vel’mi casto a napriek tomu stale na nu nee- 
xistuje ziaden liek. Preto sa mnozstvo vyskumnikov zameriava prave na tuto oblast’. 
Napriklad vpraci [4] zozbierali od 40 l’udi spolu 1040 nahravok (26 vzoriek jedneho 
cloveka), z ktorych 20 l’udi trpelo touto chorobou. Tieto zvukove nahravky obsahovali 
spracovany signal reci z vyslovovania vytrvalych samohlasok (a, o, u), cisel od 1 do 
10, kratky viet a urcitych slov. Okrem tychto atributov obsahovali data pre kazdy za- 
znam aj hodnotu UDPRS, co je vlastne unifikovana skala pre hodnotenie Parkinsonovej 
choroby urcena odbomymi lekarmi. Ich ciel’om bolo zistit’, aky typ hlasoveho zaznamu 
(vytrvale samohlasky, cisla, vety) dokazu lepsie predikovat’ tuto chorobu alebo ci tran- 
sformovanie viacerych zaznamov pacienta do urcitych suhrnnych a rozptylovych met- 
rik dokaze poskytnut’ lepsie vysledky modelov. Zaverom tohto vyskumu bolo to, ze 
najvyssiu presnost’ dosiahla vytrvala samohlaska „o“ (72,5%) a slovo „four“ (75%). 
Okrem toho taktiez zistili, ze prezentovanie viacerych zaznamov jedneho pacienta v su- 
hmnych a rozptylovych metrikach (median, standardna odchylka, medzikvartalovy 
rozsah a priemema absolutna odchylka) zlepsi zovseobecnenie prediktivneho modelu. 
Prezentovanim pacienta pomocou priemeru a standardnej odchylky ziskali 82,14% 
presnost’ modelu. 

Sledovanie progresu ochorenia pomocou hodnoty UPDRS sa venovali autori v praci 
Tsanas s kol. [5], Z dostupnych dat, ktore obsahovali rovnako signaly reci sa pokusali 
pomocou roznych typov regresie predikovat’ hodnotu UPDRS. Pri predikovani nume- 
rickeho atributu Motor- UPDRS pomocou metody CART dosiahli na testovacich datach 
najmensiu absolutnu chybu (MAE) s hodnotou 5,8. 

3 Popis dat a modelovanie 

Data, s ktorymi sme pracovali vytvoril Max Little z Univerzity v Oxforde v spolupraci 
s Narodnym centrom pre hlas a rec sidliacim v Denvery v state Colorado a su vol’ne 
dostupne na intemete v databaze UCI Machine Learning Repository [6]. Cela mnozina 
dat obsahovala zaznamy 31 pacientov, pricom 23 z nich trpelo Parkinsonovou chorobu 
[7]. Spolu bolo k dispozicii 165 zaznamov (riadkov), pretoze kazdy pacient mal v da- 
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tach viacero zaznamov, ktore boli brane nezavisle od seba. Ciel’ovy atribut bol s na- 
zvom status a obsahoval binarne hodnoty 1/0, pricom hodnota 1 znamena diagnozu 
Parkinsonovej choroby. Data obsahovali tieto atributy: Meno pacienta a cislo nahravky 
{Name), priemerna zakladna vokalna frekvencia ( MDVP :Fo(Hz)), maximalna zakladna 
vokalna frekvencia {MDVP:Fhi(Hz)), minimalna zakladna vokalna frekvencia 
(MD VP: Flo (Hz)), merania variability v zakladnej frekvencii ( MDVP:Jitter(%), 
MDVP:Jitter(Abs), MDVP.RAP, MDVP.PPQ, Jitter.DDP)), merania variability v am¬ 
plitude (MDVP:Shimmer, MDVP:Shimmer(dB), Shimmer:APQ3, Shimmer:APQ5, 
MD VP. APQ, Shimmer:DDA), merania pomeru hluku a tonovych zloziek v hlase ( NHR . 
HNR), zdravotny stav pacienta (Status), nelinearne dynamicke merania komplexnosti 
(RPDE. D3), signal fraktalovo-skalovatel’neho exponentu ( DFA), nelinearne merania 
variability zakl. frekvencie ( spreadl, spread2, PPE). Nacitanie dat, uprava dat a vsetky 
experimenty (sledovanie zavislosti medzi atributmi, vytvaranie modelov) boli vytvo- 
rene v prostredi RStudio pomocou programovacieho jazyka R. 

Pre sledovanie zavislosti medzi ciel’ovym atributom (status) a vsetkymi ostatnymi 
atributmi, ktore boli numericke, sme pouzili metodu testovanie hypotez. V zostavenych 
hypotezach sa sledovala podobnost’ medzi priemermi atributov rozdelenych podl’a cie- 
l’oveho binameho atributu. Vytvorili sme si nultu a alternativnu hypotezu: 

• Ho: Priemer pacientov dvoch mnozin je rovnaky (nezavislosf atributov) 

• H a : Medzi priemermi pacientov existuje rozdiel (zavislost’ atributov) 

Na testovanie tychto dvoch hypotez sme pouzili Welchov dvojvyberovy t-test, pri kto- 
rom sledujeme hlavne hodnotu p (p-value). Cim nizsia je tato hodnota, tym je vacsia 
pravdepodobnost’ zavislosti ciel’oveho atributu so zvolenym numerickjnn atributom. 
Vo vsetkych pripadoch nam vysla vel’mi nizka p hodnota, cize mozeme povedat’, ze 
existuju zavislosti medzi ciel’ovym a ostatnymi atributmi. Napriklad medzi Status 
a MDVP:Fhi vysla p-hodnota = 0,028, co znamena, ze na (l-p)*100 percent mozeme 
zamietnut’ Ho a potvrdit’ H A - v tomto pripade max. s 97,2% doverou zamietame Ho. 

Vytvorenie modelov pre predikciu Parkinsonovej choroby sme robili pomocou me- 
tody rozhodovacich stromov (algoritmus C4.5, C5.0, CART) a Naivneho Bayesov- 
skeho klasifikatora. Pred samotnym modelovanim sme data rozdelili najprv na treno- 
vaciu a testovaciu mnozinu v pomeroch 70:30 a 80:20. Hodnoty ziskane modelmi boli 
porovnavane s hodnotami ciel’oveho atributu v testovacej mnozine, vd’aka cornu sme 
vypocitali presnost’ modelov pomocou vzorca P = TP+FN / (TP+FP+TN+FN). Tento 
vzorec vyjadruje pomer spravne klasifikovanych zaznamov ku vsetkym zaznamom 
v datach. Dosiahnute vysledky ulozene v Tab. 1 sme ziskali na datach z testovacej mno- 
ziny. Mozeme si vsimnut’, ze najlepsie vysledky dosiahol algoritmus C4.5 s uspesnos- 
t’ou 91,43% pri rozdeleni dat v pomere 80/20. Je potrebne si uvedomit’, ze pri rozdeleni 
dat v pomere 80/20 sme mali k dispozicii viac dat na trenovanie vytvaraneho modelu, 
ktory tak mohol zachytit’ viac vzorov v datach a lepsie klasifikovat’ data z testovacej 
mnoziny. Najhorsie vysledky mal Naivny Bayesovsky klasifikator pri obidvoch rozde- 
leniach dat. V pripade pouzitia metody zhlukovania sme dosiahli uspesnost’ iba 
73,85%. Tato metoda rozdelila zaznamy podl’a podobnosti hodnot ich atributov do 
dvoch zhlukov (tried), kde tieto triedy boli porovnane so statusom pacientov (1 - pa- 
cienti trpiaci Parkinsonovou chorobou, 0 - zdravi pacienti). 
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Tab 9. Dosiahnute vysledky pouzitych metod 


Rozdelenie 

dat 

i Rozhodovacie stromy 

Naivny Bayesovsky 
klasifikator 

C4.5 

C5.0 

CART 

70/30 

90,90% 

81,81% 

86,36% 

62.12% 

80/20 

91,43% 

88,57% 

82,86% 

77,14% 


4 Zaver a buduca praca 

V tomto clanku sme poplsali prve experimenty a modely pre predikciu Parkinsonovej 
choroby na datach ziskanych zo zaznamov reci pacientov. Podl’a vysledkov v Tab 1. si 
mozeme vsimnut’, ze metoda rozhodovacich stromov ajej algoritmus C4.5 dosiahol 
vel’mi dobre vysledky, dokonca pri obidvoch rozdeleniach dat ma presnost’ nad 90%. 

V publikacii [8] bola na rovnakych datach dosiahnuta najvyssia presnost’ 76% pomo¬ 
cou metody Support Vector Machine. Mozeme teda povedat’, ze klasifikovat’ pacientov 
podl’a transformovanych ukazovatel’ov (atributov) ich reci je mozne v celku uspesne. 
Testovanie hypotez nam taktiez dokazalo, ze vsetky ziskane atributy zo zaznamov reci 
bob dolezite pre vyslednu klasifikaciu pacienta. 

V buducej praci by sme sa chceli zamerat’ na tento typ choroby a riesit’ d’alsie vyzvy 
v tejto oblasti. Dolezite je nielen spravne urcit’, ci pacient trpi Parkinsonovou chorobu, 
ale aj v akom stadiu sa nachadza. Najt’azsie bude podl’a vsetkeho spravne urcit’ pocia- 
tocne stadium tejto choroby, kde predpokladame, ze sa jednotlive ukazovatele reci ne- 
budu az tak lisit’ od zdravych l’udi. Urcenie stadia choroby chceme riesit’ aj vypoctom 
ukazovatel’a UPDRS, napriklad pomocou roznych metod regresie ked’ze sa jedna o nu- 
mericky atribut. Na internete su k datam v tabul’ke dostupne aj nahravky, z ktorych bob 
ziskane jednotlive atributy. Tieto atributy bob z reci pacientov transformovane pomo¬ 
cou softveru Praat Acoustic Analysis, ktory je vol’ne dostupny na internete. Taktiez by 
sme chceli skusif aj ine softvery, resp. ziskat’ aj ine ukazovatele, pomocou ktorych by 
sme dokazali vytvorit’ lepsie predikcne modely s vyssou presnost’ou. 

Pod’akovanie. Tato publikacia vznikla vd’aka podpore Vedeckej grantovej agentury 
MSVVaS SR a SAV projekt c. 1/0493/16. 
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Annotation: 

Parkinson’s disease Symptoms Prediction using the speech signals in the data mining methods. 

Health records of patients sourced from various testing methods are frequently used in medical 
field for symptoms determination as well as for selected diseases probability prediction. There 
are numerous symptoms among the population suffering from Parkinson’s disease, however dys- 
phonia - changes in speech and articulation - is the most significant precursor. This is the reason 
why the article is focused on patients classification based on their speech signals using the data 
mining methods (Naive Bayes classifier and decision trees - algorithms C4.5, C5.0 and CART). 
The Dataset applied in the article consists of 31 individuals’ voice measurings, with each of the 
individuals being represented by circa 6 records within the set. The dataset was primarily split 
into the training and testing sets, followed by the models implementation. The accuracy of the 
values obtained employing the models was calculated using the contingency table. In addition, 
the binary format target attribute dependency upon other attributes was examined using hypoth¬ 
eses. 



